kskp kskp - 3 months ago 5x
Python Question

Python Dictionary with a list of values for each key

I have two different text files: One with words and their frequencies that looks like:


Second one is a file that has a word in the first place followed by its associated features. It looks like:


Every word in the second file may have any number of features (ranging from 0-7 in my case)

For every word in file 1, I want all the features associated with it from file 2. I want to create a dictionary where the key is the word from file 1 and its corresponding value is a list of features obtained from file 2.

Also, I want unique features and want to eliminate duplicates from file 2 (I have not implemented it yet).

I have the following code but it gives the desired output only for the first word in file 1.
does contain all the other words from file 1 but they don't have any values associated with them.

mydict = dict()

with open('sample_word_freq_sorted.txt', 'r') as f1:
data = f1.readlines()

with open('sample_features.txt', 'r') as f2:
for item in data:
root = item.split()[0]
mylist = []
for line in f2:
words = line.split()
if words[0] == root:
mydict[root] = mylist

Also, the values for each key are different lists and not just one list which is not what I want. Can someone please help me with the bug is in my code?


A file is an iterator meaning you can only iterate over it once:

>>> x = (i for i in range(3)) #example iterator
>>> for line in x:

>>> for line in x: #second time produces no results.


So the loop for line in f2: only produces values for the first time it is used (the first iteration of for item in data:) To fix this you can either do f2 = f2.readlines() so you have a list that can be traversed more then once or find a way to construct your dictionary with only one iteration of f2.

Then you get a list of sublists because you .append() each list of words to mylist, instead of .extending it by the additional words, so just changing:




Should fix the other issue you are having.

This seems like a case where collections.defaultdict would come in handy, instead of going over the file many times adding items for each specific word the dict will automatically make empty lists for each new word, this would let you write your code something like this:

import collections
mydict = collections.defaultdict(list)

with open('sample_features.txt', 'r') as f2:
    for line in f2:
        tmp = line.split()
        root = tmp[0]
        words = tmp[1:]
        #in python 3+ we can use this notation instead of the above three lines:
        #root, *words = line.split()

Although since you want to keep only unique features it would make more sense to use sets instead of lists since they -by definition- only contain unique elements, then instead of using .extend you would use .update:

import collections
mydict = collections.defaultdict(set)