user1893110 user1893110 - 6 months ago 52
Python Question

Python printing to a file. Sorting with defaultdict

I have some code:

filename = "training_data.txt"
with open(filename,'r') as infile:
d = defaultdict(lambda: defaultdict(int))
tagRE = re.compile(r'[A-Za-z]+/[A-Z]+')
for line in infile:
for token in tagRE.findall(line):
word, tag = token.split("/")
d[word][tag] += 1

f = open('out.txt', 'w')
for word, word_data in d.items():
f.write(word + " " + " ".join(tag + ":" + str(freq) + '\n'
for tag, freq in word_data.items()))

The training data is part-of-speech tagged text e.g.

Today/NN ,/, PC/NN shipments/NNS annually/RB total/VBP some/DT $/$ 38.3/CD billion/CD world-wide/JJ ./.

Text written to the file should be of the format: word: part-of-speech:frequency where if a word has multiple tags, this and the frequency are on the same line. At the moment, the linebreak is putting tags onto a new line if a word has more than one of these. I would like to:

1) Have these on the same line e.g.
mean VBP:7 JJ:1 NN:2 VB:27

2) Have these frequencies printed in descending order. Does my data structure allow for this? I can't work out how I would do this.


tagfreq = " ".join(tag + ":" + str(freq) 
           for tag, freq in 
           sorted(word_data.items(), key=lambda x: x[1], reversed=True))
w = ''.join([word, " ",  tagfreq, '\n'])

Use join instead of + for strings, generally. moved the \n to the end of the write and sorted items by frequency in descending order.