nrksj nrksj - 1 year ago 62
Python Question

Ordering a nested dictionary by the frequency of the nested value

I have this

made from a csv which is massive.
For every item in
, I have broken it into it's
is always between 0-3 characters max length and
is variable.
I created an empty dictionary, D...(rest of code below):


for v in list:

id = v[0:3]
details = v[3:]

if id not in D:
D[id] = {}

if details not in D[id]:
D[id][details] = 0

D[id][details] += 1

aside: Can you help me understand what the two
statements are doing? Very new to python and programming.

Anyway, it produces something like this:

{'KEY1_1': {'key2_1' : value2_1, 'key2_2' : value2_2, 'key2_3' : value2_3},
'KEY1_2': {'key2_1' : value2_1, 'key2_2' : value2_2, 'key2_3' : value2_3},
and many more KEY1's with variable numbers of key2's

Each 'KEY1' is unique but each 'key2' isn't necessarily. The
are all different.

Ok so, right now I found a way to sort by the first KEY

for k, v in sorted(D.items()):
print k, ':', v

I have done enough research to know that dictionaries can't really be sorted but I don't care about sorting, I care about ordering or more specifically frequencies of occurrence. In my code
is the number of times its corresponding
occurs for that particular
. I am starting to think I should have used better variable names.

Question: How do I order the top-level/overall dictionary by the number in
which is in the nested dictionary? I want to do some statistics to those numbers like...

  1. How many times does the most frequent KEY1_x:key2_x pair show up?

  2. What are the 10, 20, 30 most frequent KEY1_x:key2_x pairs?

Can I only do that by each
or can I do it overall? Bonus: If I could order it that way for presentation/sharing that would be very helpful because it is such a large data set. So much thanks in advance and I hope I've made my question and intent clear.


You could use Counter to order the key pairs based on their frequency. It also provides an easy way to get x most frequent items:

from collections import Counter

d = {
    'KEY1': {
        'key2_1': 5,
        'key2_2': 1,
        'key2_3': 3
    'KEY2': {
        'key2_1': 2,
        'key2_2': 3,
        'key2_3': 4

c = Counter()
for k, v in d.iteritems():
    c.update({(k, k1): v1 for k1, v1 in v.iteritems()})

print c.most_common(3)


[(('KEY1', 'key2_1'), 5), (('KEY2', 'key2_3'), 4), (('KEY2', 'key2_2'), 3)]

If you only care about the most common key pairs and have no other reason to build nested dictionary you could just use the following code:

from collections import Counter

l = ['foobar', 'foofoo', 'foobar', 'barfoo']
D = Counter((v[:3], v[3:]) for v in l)
print D.most_common() # [(('foo', 'bar'), 2), (('foo', 'foo'), 1), (('bar', 'foo'), 1)]

The if statements you asked about are checking if the key exists in dict:

>>> d = {1: 'foo'}
>>> 1 in d
>>> 2 in d

So the following code will check if key with value of id exists in dict D and if it doesn't it will assign empty dict there.

if id not in D:
    D[id] = {}

The second if does exactly the same for nested dictionaries.