user1362058 user1362058 - 1 year ago 78
Python Question

Efficiently check permutations of a small string against a dictionary

I'm trying to see what words can be made with a scrambled string, compared to a dictionary. I got some help with the long string case already, but I think my short string case is what is dragging my program into the 20 seconds range. I am testing with 1000 scrambles and a dictionary of about 170,000 "words".

For the short scrambled word case I thought it would be more efficient to create every permutation of the string and compare that against the dictionary entries, like so:

from itertools import permutations

wordStore = {
8:['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc',
'zsdfvsvv', 'sdffbrfv', 'sdjfjsjf', 'sjnshsnj', 'adhnsrhn', 'sdfbhxdf', 'zsdfgzdf', 'cnzsdfgf', 'sdbdzvff',
'dbgtbzdf', 'zsvrvrdz', 'zdrvrvrn', 'nhcncnby', 'mmmnyndd', 'zswewedf', 'zeswffee', 'sefdedee', 'sefeefee',
'iuygfjhg', 'uytmjnbb', 'uythbgvf', 'ytrgfdcv', 'ytregfcv', 'ytrevcxd', 'ytrevcxs', 'ytrewgfd', 'trewgfds',
'uytrgfdd', 'uytrenhg', 'ytrebgfd', 'jhgfdbvc', 'mnbvyhtr', 'ytrehbgv', 'uytrwwsz', 'mnbtrexx', 'uytrebgv',
'fgfgfvdw', 'werfdcse', 'mnbvcdes', 'kjhgfnbv', 'sdfhgfdw', 'yujhredq', 'wsxrtyhn', 'jfrvsdxw', 'jmrtgedw',
'ujrtgedw', 'ujtgedws', 'yhvedsgy', 'yhygdfex', 'kjjkjuhy', 'rffdddwe', 'esrdtfgd', 'uytrewww', 'vfcdtred',
'kjhgfnbv', 'uytrbvcd', 'jhgfhgfd', 'adfgdfgg', 'mnbvtred', 'jhgfrewb', 'hgfdtred', 'dsfgdfgg', 'dfgdgggg']

scrambles = set([''.join(p) for p in permutations('acowbtec',8)])
for x in scrambles.intersection(wordStore[8]):
print('Found ', x)

I created a small simple set to test against here.

As you can see, it's rather straight forward, but it's way too slow. Here's the relevant cProfile sections from my larger data set test.

ncalls tottime percall cumtime percall filename:lineno(function)
1 9.324 9.324 29.804 29.804<module>)
990 9.053 0.009 16.147 0.016<listcomp>)
990 2.205 0.002 2.205 0.002 {method 'intersection' of 'set' objects}
39916800 7.093 0.000 7.093 0.000 {method 'join' of 'str' objects}

I don't fully understand the cProfile results. It looks like on a per call basis they aren't too slow, but overall they take too much time. Any ideas on how I can speed this up?


With Dan's help I have drastically sped up my program. But I have this initialization that just doesn't seem right. How is it supposed to be done?

with open(file1) as f:
for line in f:
line = line.rstrip()
wordStore[len(line)].setdefault(''.join(sorted(line)), []).append(line)
wordStore[len(line)] = {}
wordStore[len(line)].setdefault(''.join(sorted(line)), []).append(line)

Answer Source

Rather than generating the permutations, search the strings after normalizing them using their sorted order. Start with a linear search and then use the hash index:

>>> eight = ['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc',
...        'zsdfvsvv', 'sdffbrfv', 'sdjfjsjf', 'sjnshsnj', 'adhnsrhn', 'sdfbhxdf', 'zsdfgzdf', 'cnzsdfgf', 'sdbdzvff',
...        'dbgtbzdf', 'zsvrvrdz', 'zdrvrvrn', 'nhcncnby', 'mmmnyndd', 'zswewedf', 'zeswffee', 'sefdedee', 'sefeefee',
...        'iuygfjhg', 'uytmjnbb', 'uythbgvf', 'ytrgfdcv', 'ytregfcv', 'ytrevcxd', 'ytrevcxs', 'ytrewgfd', 'trewgfds',
...        'uytrgfdd', 'uytrenhg', 'ytrebgfd', 'jhgfdbvc', 'mnbvyhtr', 'ytrehbgv', 'uytrwwsz', 'mnbtrexx', 'uytrebgv',
...        'fgfgfvdw', 'werfdcse', 'mnbvcdes', 'kjhgfnbv', 'sdfhgfdw', 'yujhredq', 'wsxrtyhn', 'jfrvsdxw', 'jmrtgedw',
...        'ujrtgedw', 'ujtgedws', 'yhvedsgy', 'yhygdfex', 'kjjkjuhy', 'rffdddwe', 'esrdtfgd', 'uytrewww', 'vfcdtred',
...        'kjhgfnbv', 'uytrbvcd', 'jhgfhgfd', 'adfgdfgg', 'mnbvtred', 'jhgfrewb', 'hgfdtred', 'dsfgdfgg', 'dfgdgggg']

>>> map(lambda s: ''.join(sorted(s)), eight)
['abcceotw', 'abcceotw', 'abcceotw', 'abcceotw', 'abcceotw', 'abcceotw', 'abcceotw', 'abcceotw', 'abcceotw', 'dfssvvvz', 'bdfffrsv', 'dffjjjss', 'hjjnnsss', 'adhhnnrs', 'bddffhsx', 'ddffgszz', 'cdffgnsz', 'bddffsvz', 'bbddfgtz', 'drrsvvzz', 'dnrrrvvz', 'bcchnnny', 'ddmmmnny', 'deefswwz', 'eeeffswz', 'ddeeeefs', 'eeeeeffs', 'fgghijuy', 'bbjmntuy', 'bfghtuvy', 'cdfgrtvy', 'cefgrtvy', 'cdertvxy', 'cerstvxy', 'defgrtwy', 'defgrstw', 'ddfgrtuy', 'eghnrtuy', 'bdefgrty', 'bcdfghjv', 'bhmnrtvy', 'beghrtvy', 'rstuwwyz', 'bemnrtxx', 'begrtuvy', 'dfffggvw', 'cdeefrsw', 'bcdemnsv', 'bfghjknv', 'ddffghsw', 'dehjqruy', 'hnrstwxy', 'dfjrsvwx', 'degjmrtw', 'degjrtuw', 'degjstuw', 'deghsvyy', 'defghxyy', 'hjjjkkuy', 'dddeffrw', 'ddefgrst', 'ertuwwwy', 'cddefrtv', 'bfghjknv', 'bcdrtuvy', 'dffgghhj', 'addffggg', 'bdemnrtv', 'befghjrw', 'ddefghrt', 'ddffgggs', 'ddfggggg']

>>> ''.join(sorted('acowbtec'))

A linear search is fast enough for this data set but it is possible to use a dictionary and index the strings by their sorted versions.

>>> [v for v in eight if ''.join(sorted(v)) == ''.join(sorted('acowbtec'))]
['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc']

Timeit reports that this linear search takes:

>>> timeit.timeit(setup="eight = ['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc','zsdfvsvv', 'sdffbrfv', 'sdjfjsjf', 'sjnshsnj', 'adhnsrhn', 'sdfbhxdf', 'zsdfgzdf', 'cnzsdfgf', 'sdbdzvff','dbgtbzdf', 'zsvrvrdz', 'zdrvrvrn', 'nhcncnby', 'mmmnyndd', 'zswewedf', 'zeswffee', 'sefdedee', 'sefeefee','iuygfjhg', 'uytmjnbb', 'uythbgvf', 'ytrgfdcv', 'ytregfcv', 'ytrevcxd', 'ytrevcxs', 'ytrewgfd', 'trewgfds','uytrgfdd', 'uytrenhg', 'ytrebgfd', 'jhgfdbvc', 'mnbvyhtr', 'ytrehbgv', 'uytrwwsz', 'mnbtrexx', 'uytrebgv','fgfgfvdw', 'werfdcse', 'mnbvcdes', 'kjhgfnbv', 'sdfhgfdw', 'yujhredq', 'wsxrtyhn', 'jfrvsdxw', 'jmrtgedw','ujrtgedw', 'ujtgedws', 'yhvedsgy', 'yhygdfex', 'kjjkjuhy', 'rffdddwe', 'esrdtfgd', 'uytrewww', 'vfcdtred','kjhgfnbv', 'uytrbvcd', 'jhgfhgfd', 'adfgdfgg', 'mnbvtred', 'jhgfrewb', 'hgfdtred', 'dsfgdfgg', 'dfgdgggg']",stmt="[v for v in eight if ''.join(sorted(v)) == ''.join(sorted('acowbtec'))]",number=1000)

0.2 seconds for 1000 iterations.

Creating an index of {sorted:[unsorted]} and indexing that dictionary by the sorted query string can make performing multiple queries faster than performing each of their separately with linear searches.

Building that index is simply:

>>> index = {}
>>> for v in eight:
...     index.setdefault(''.join(sorted(v)), []).append(v)
>>> index
{'hjjnnsss': ['sjnshsnj'], 'bbddfgtz': ['dbgtbzdf'], 'ddffgggs': ['dsfgdfgg'], 'defghxyy': ['yhygdfex'], 'begrtuvy': ['uytrebgv'], 'dffjjjss': ['sdjfjsjf'], 'cefgrtvy': ['ytregfcv'], 'dddeffrw': ['rffdddwe'], 'befghjrw': ['jhgfrewb'], 'eeeeeffs': ['sefeefee'], 'ddfgrtuy': ['uytrgfdd'], 'cdfgrtvy': ['ytrgfdcv'], 'deefswwz': ['zswewedf'], 'cerstvxy': ['ytrevcxs'], 'bdemnrtv': ['mnbvtred'], 'bbjmntuy': ['uytmjnbb'], 'ddmmmnny': ['mmmnyndd'], 'ddfggggg': ['dfgdgggg'], 'bcchnnny': ['nhcncnby'], 'ddeeeefs': ['sefdedee'], 'bcdfghjv': ['jhgfdbvc'], 'dfffggvw': ['fgfgfvdw'], 'bemnrtxx': ['mnbtrexx'], 'bhmnrtvy': ['mnbvyhtr'], 'cdeefrsw': ['werfdcse'], 'dnrrrvvz': ['zdrvrvrn'], 'cdertvxy': ['ytrevcxd'], 'bdefgrty': ['ytrebgfd'], 'dffgghhj': ['jhgfhgfd'], 'ddffgszz': ['zsdfgzdf'], 'cdffgnsz': ['cnzsdfgf'], 'fgghijuy': ['iuygfjhg'], 'hjjjkkuy': ['kjjkjuhy'], 'bddffhsx': ['sdfbhxdf'], 'ddefgrst': ['esrdtfgd'], 'degjrtuw': ['ujrtgedw'], 'bcdemnsv': ['mnbvcdes'], 'bfghjknv': ['kjhgfnbv', 'kjhgfnbv'], 'defgrtwy': ['ytrewgfd'], 'rstuwwyz': ['uytrwwsz'], 'bdfffrsv': ['sdffbrfv'], 'ddefghrt': ['hgfdtred'], 'bfghtuvy': ['uythbgvf'], 'eeeffswz': ['zeswffee'], 'drrsvvzz': ['zsvrvrdz'], 'ddffghsw': ['sdfhgfdw'], 'abcceotw': ['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc'], 'dfjrsvwx': ['jfrvsdxw'], 'eghnrtuy': ['uytrenhg'], 'addffggg': ['adfgdfgg'], 'cddefrtv': ['vfcdtred'], 'bcdrtuvy': ['uytrbvcd'], 'degjmrtw': ['jmrtgedw'], 'bddffsvz': ['sdbdzvff'], 'adhhnnrs': ['adhnsrhn'], 'ertuwwwy': ['uytrewww'], 'degjstuw': ['ujtgedws'], 'dfssvvvz': ['zsdfvsvv'], 'hnrstwxy': ['wsxrtyhn'], 'beghrtvy': ['ytrehbgv'], 'deghsvyy': ['yhvedsgy'], 'defgrstw': ['trewgfds'], 'dehjqruy': ['yujhredq']}

Timeit states that this takes:

>>> timeit.timeit(setup="eight = ['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc','zsdfvsvv', 'sdffbrfv', 'sdjfjsjf', 'sjnshsnj', 'adhnsrhn', 'sdfbhxdf', 'zsdfgzdf', 'cnzsdfgf', 'sdbdzvff','dbgtbzdf', 'zsvrvrdz', 'zdrvrvrn', 'nhcncnby', 'mmmnyndd', 'zswewedf', 'zeswffee', 'sefdedee', 'sefeefee','iuygfjhg', 'uytmjnbb', 'uythbgvf', 'ytrgfdcv', 'ytregfcv', 'ytrevcxd', 'ytrevcxs', 'ytrewgfd', 'trewgfds','uytrgfdd', 'uytrenhg', 'ytrebgfd', 'jhgfdbvc', 'mnbvyhtr', 'ytrehbgv', 'uytrwwsz', 'mnbtrexx', 'uytrebgv','fgfgfvdw', 'werfdcse', 'mnbvcdes', 'kjhgfnbv', 'sdfhgfdw', 'yujhredq', 'wsxrtyhn', 'jfrvsdxw', 'jmrtgedw','ujrtgedw', 'ujtgedws', 'yhvedsgy', 'yhygdfex', 'kjjkjuhy', 'rffdddwe', 'esrdtfgd', 'uytrewww', 'vfcdtred','kjhgfnbv', 'uytrbvcd', 'jhgfhgfd', 'adfgdfgg', 'mnbvtred', 'jhgfrewb', 'hgfdtred', 'dsfgdfgg', 'dfgdgggg']",stmt="index={}\nfor v in eight:index.setdefault(''.join(sorted(v)), []).append(v)",number=1000)

0.2 seconds for 1000 iterations.

Then querying it is:

>>> index[''.join(sorted('acowbtec'))]
['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc']

Timeit states that this takes:

>>> timeit.timeit(setup="index = {'hjjnnsss': ['sjnshsnj'], 'bbddfgtz': ['dbgtbzdf'], 'ddffgggs': ['dsfgdfgg'], 'defghxyy': ['yhygdfex'], 'begrtuvy': ['uytrebgv'], 'dffjjjss': ['sdjfjsjf'], 'cefgrtvy': ['ytregfcv'], 'dddeffrw': ['rffdddwe'], 'befghjrw': ['jhgfrewb'], 'eeeeeffs': ['sefeefee'], 'ddfgrtuy': ['uytrgfdd'], 'cdfgrtvy': ['ytrgfdcv'], 'deefswwz': ['zswewedf'], 'cerstvxy': ['ytrevcxs'], 'bdemnrtv': ['mnbvtred'], 'bbjmntuy': ['uytmjnbb'], 'ddmmmnny': ['mmmnyndd'], 'ddfggggg': ['dfgdgggg'], 'bcchnnny': ['nhcncnby'], 'ddeeeefs': ['sefdedee'], 'bcdfghjv': ['jhgfdbvc'], 'dfffggvw': ['fgfgfvdw'], 'bemnrtxx': ['mnbtrexx'], 'bhmnrtvy': ['mnbvyhtr'], 'cdeefrsw': ['werfdcse'], 'dnrrrvvz': ['zdrvrvrn'], 'cdertvxy': ['ytrevcxd'], 'bdefgrty': ['ytrebgfd'], 'dffgghhj': ['jhgfhgfd'], 'ddffgszz': ['zsdfgzdf'], 'cdffgnsz': ['cnzsdfgf'], 'fgghijuy': ['iuygfjhg'], 'hjjjkkuy': ['kjjkjuhy'], 'bddffhsx': ['sdfbhxdf'], 'ddefgrst': ['esrdtfgd'], 'degjrtuw': ['ujrtgedw'], 'bcdemnsv': ['mnbvcdes'], 'bfghjknv': ['kjhgfnbv', 'kjhgfnbv'], 'defgrtwy': ['ytrewgfd'], 'rstuwwyz': ['uytrwwsz'], 'bdfffrsv': ['sdffbrfv'], 'ddefghrt': ['hgfdtred'], 'bfghtuvy': ['uythbgvf'], 'eeeffswz': ['zeswffee'], 'drrsvvzz': ['zsvrvrdz'], 'ddffghsw': ['sdfhgfdw'], 'abcceotw': ['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc'], 'dfjrsvwx': ['jfrvsdxw'], 'eghnrtuy': ['uytrenhg'], 'addffggg': ['adfgdfgg'], 'cddefrtv': ['vfcdtred'], 'bcdrtuvy': ['uytrbvcd'], 'degjmrtw': ['jmrtgedw'], 'bddffsvz': ['sdbdzvff'], 'adhhnnrs': ['adhnsrhn'], 'ertuwwwy': ['uytrewww'], 'degjstuw': ['ujtgedws'], 'dfssvvvz': ['zsdfvsvv'], 'hnrstwxy': ['wsxrtyhn'], 'beghrtvy': ['ytrehbgv'], 'deghsvyy': ['yhvedsgy'], 'defgrstw': ['trewgfds'], 'dehjqruy': ['yujhredq']}",stmt="index[''.join(sorted('acowbtec'))]",number=1000)

0.002 seconds for 1000 iterations.

Both of these steps are really efficient.