Merging all sub-arrays with mutual elements into one sub-array_问答_开发者

I need to find all sub-arrays which share any mutual element and merge them into one sub-array. (Implementing in Python but any algorithmic idea would be helpful)

Multidimensional array structure:

categories = {'car':['automobile','auto'],
             'bike':['vehicle','motorcycle','motorbike','automobile'],
             'software':['computer','macbook','apple','microsoft','mozilla'],
             'firefox':['internet','mozilla','browser']
             'bicycle':['vehicle']}

I'd like to have 'car', 'bike' and 'bicycle' merged into one list (~~keep first 开发者_开发百科list's key~~ new list's key could be any of the relevant keys) and 'software' and 'firefox' merged into one list as well.

Performance is crucial.

Best solution I could come with so far is to maintain a flatten one-dimension array of element => list_key (e.g 'automobile' => 'car') and then run the following recursive function for each list in the multidimensional array (pseudocode):

function merge_similar(list_key):
    For each element in categories[list_key]:
        If flatten_array.has_key(element):
            list_to_merge = flatten_array[element]
            merge_similar(list_to_merge) /* merge other lists which share an element with our newly found similar list */
            categories[list_key] = merge(categories [list_key], categories[list_to_merge])
            delete categories[list_to_merge]

Any idea how to improve it's performance?

Thanks!

Note that there is no "first key" -- dicts don't keep order, so if you need some order preserved you'll need to start from some different, alternative data structure.

Apart from order-related issues, I'd start with something like:

def merged(dictoflists):
  result = dict()
  reversed = dict()
  for k, l in dictoflists.iteritems():
    intersecting = set(reversed.get(w) for w in l) - set([None])
    if intersecting:
      pickone = intersecting.pop()
      into = result[pickone]
    else:
      pickone = k
      into = result[k] = set()
    for ok in intersecting:
      into.update(result.pop(ok))
    into.update(l)
    for w in into:
      reversed[w] = pickone
  return dict((k, sorted(l)) for k, l in result.iteritems())

If order is important to you, the uses of set will be problematic and you'll need more complicated (and slower) data structures -- however, if that's the case, you should first specify in complete detail exactly what ordering constraints you need to respect in the various possible cases that can occur.

I can't imagine that a recursive solution would be speedy.
Is using list.extend() too slow?
You could do something like this:

categories['car'].extend(categories['bike']);
categories['car'].extend(categories['bicycle']);

Or to be more general, if you pass in a list of keys you want to merge:

first_key=None;
for key in keys_whose_lists_I_want_to_merge:
    if first_key is None:
        first_key=key;
    else:
        categories[first_key].extend(categories[key]);

If you're merging a ton of lists, you can optimize that loop to not perform the None check after the first time. See the tip entitled 'Re-map Functions at runtime' on the Python Performance Tips page.

>>> categories = {'car':['automobile','auto'],
             'bike':['vehicle','motorcycle','motorbike','automobile'],
             'software':['computer','macbook','apple','microsoft','mozilla'],
             'firefox':['internet','mozilla','browser'],
             'bicycle':['vehicle']}
>>> # Use sets for values
>>> for k,v in categories.items(): categories[k] = set(v)

>>> # Acumulate
>>> for k1, v1 in categories.items():
    if v1:
        for k2,v2 in categories.items():
            if v2 and k1 != k2 and v1 & v2:
                v1 |= v2
                categories[k2] = None
        categories[k1] = v1


>>> # Print
>>> for k1, v1 in categories.items():
    if v1: print('%s: %r' %(k1,v1))


bicycle: {'motorbike', 'vehicle', 'auto', 'automobile', 'motorcycle'}
firefox: {'apple', 'mozilla', 'macbook', 'computer', 'internet', 'microsoft', 'browser'}
>>>