nrksj nrksj - 1 year ago 74
Python Question

Ordering a nested dictionary by the frequency of the nested value

I have this

list
made from a csv which is massive.
For every item in
list
, I have broken it into it's
id
and
details
.
id
is always between 0-3 characters max length and
details
is variable.
I created an empty dictionary, D...(rest of code below):

D={}

for v in list:

id = v[0:3]
details = v[3:]

if id not in D:
D[id] = {}

if details not in D[id]:
D[id][details] = 0

D[id][details] += 1


aside: Can you help me understand what the two
if
statements are doing? Very new to python and programming.

Anyway, it produces something like this:

{'KEY1_1': {'key2_1' : value2_1, 'key2_2' : value2_2, 'key2_3' : value2_3},
'KEY1_2': {'key2_1' : value2_1, 'key2_2' : value2_2, 'key2_3' : value2_3},
and many more KEY1's with variable numbers of key2's


Each 'KEY1' is unique but each 'key2' isn't necessarily. The
value2_
s
are all different.

Ok so, right now I found a way to sort by the first KEY

for k, v in sorted(D.items()):
print k, ':', v


I have done enough research to know that dictionaries can't really be sorted but I don't care about sorting, I care about ordering or more specifically frequencies of occurrence. In my code
value2_x
is the number of times its corresponding
key2_x
occurs for that particular
KEY1_x
. I am starting to think I should have used better variable names.

Question: How do I order the top-level/overall dictionary by the number in
value2_x
which is in the nested dictionary? I want to do some statistics to those numbers like...


  1. How many times does the most frequent KEY1_x:key2_x pair show up?

  2. What are the 10, 20, 30 most frequent KEY1_x:key2_x pairs?



Can I only do that by each
KEY1
or can I do it overall? Bonus: If I could order it that way for presentation/sharing that would be very helpful because it is such a large data set. So much thanks in advance and I hope I've made my question and intent clear.

Answer Source

You could use Counter to order the key pairs based on their frequency. It also provides an easy way to get x most frequent items:

from collections import Counter

d = {
    'KEY1': {
        'key2_1': 5,
        'key2_2': 1,
        'key2_3': 3
    },
    'KEY2': {
        'key2_1': 2,
        'key2_2': 3,
        'key2_3': 4
    }
}

c = Counter()
for k, v in d.iteritems():
    c.update({(k, k1): v1 for k1, v1 in v.iteritems()})

print c.most_common(3)

Output:

[(('KEY1', 'key2_1'), 5), (('KEY2', 'key2_3'), 4), (('KEY2', 'key2_2'), 3)]

If you only care about the most common key pairs and have no other reason to build nested dictionary you could just use the following code:

from collections import Counter

l = ['foobar', 'foofoo', 'foobar', 'barfoo']
D = Counter((v[:3], v[3:]) for v in l)
print D.most_common() # [(('foo', 'bar'), 2), (('foo', 'foo'), 1), (('bar', 'foo'), 1)]

The if statements you asked about are checking if the key exists in dict:

>>> d = {1: 'foo'}
>>> 1 in d
True
>>> 2 in d
False

So the following code will check if key with value of id exists in dict D and if it doesn't it will assign empty dict there.

if id not in D:
    D[id] = {}

The second if does exactly the same for nested dictionaries.