alvas alvas - 9 months ago 52
Python Question

How to revers index a Counter and convert it into a defaultdict? Python

Given a Counter, e.g.:

>>> from collections import Counter
>>> Counter('123112415121361273')
Counter({'1': 7, '2': 4, '3': 3, '5': 1, '4': 1, '7': 1, '6': 1})

How can I reverse the index and get the counts as keys and the values as lists of original string keys?

The purpose is to convert the example above into something like this:

defaultdict(<type 'list'>, {1: ['5', '4', '7', '6'], 3: ['3'], 4: ['2'], 7: ['1']})

I have tried tried manually reiterating through the

>>> from collections import Counter
>>> Counter('123112415121361273')
Counter({'1': 7, '2': 4, '3': 3, '5': 1, '4': 1, '7': 1, '6': 1})
>>> x = Counter('123112415121361273')
>>> from collections import Counter, defaultdict
>>> y = defaultdict(list)
>>> for s, count in x.items():
... y[count].append(s)
>>> y
defaultdict(<type 'list'>, {1: ['5', '4', '7', '6'], 3: ['3'], 4: ['2'], 7: ['1']})

But is there any other way to do this?

Since the input is the string
and the output is supposed to be the dictionary indexed by counts, is there any way to avoid the counting step when iterating through it the first time and get to the resulting defaultdict?

Answer Source

No, there is no more efficient way.

Counting is best done with a mapping, which is exactly what Counter does. Since you don't know the final count for any character until you fully traversed the string, you can't know up-front what bucket to file a character into until you have completed counting.

So the in-efficient alternative is to start with a mapping from count to characters, then move characters up to the next bucket as you find that they already have a count. Finding that they already have a count requires that you test against every bucket, so that becomes a O(NK) solution as opposed to your straightforward linear O(N) solution that the Counter gives you.

## Warning: this is not an efficient approach; use for illustration purposes only

from collections import defaultdict

s = '123112415121361273'
count_to_char = defaultdict(set)  # use a set to avoid O(N**2) performance
max_count = 0
for char in s:  # loop over N items
    for i in range(1, max_count + 1):  # loop over up to K buckets
        if char in count_to_char[i]:
            count_to_char[i + 1].add(char)
        i = 0
    max_count = max(i + 1, max_count)
# remove empty buckets again
for count in [c for c, b in count_to_char.items() if not b]:
    del count_to_char[count]

The way to avoid that scan over K buckets is to use a counter, which you already use.