Vince Carter Vince Carter - 12 days ago 6
Python Question

Finding duplicates in a list of a list, and adding their values

I'm trying to find the top 50 words that occur within three texts of Shakespeare and the ratio of each words occurrance in, macbeth.txt, allswell.txt, and othello.txt. Here is my code so far:

def byFreq(pair):
return pair[1]

def shakespeare():
counts = {}
A = []
for words in ['macbeth.txt','allswell.txt','othello.txt']:
text = open(words, 'r').read()
test = text.lower()

for ch in '!"$%&()*+,-./:;<=>?@[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()

for w in words:
counts[w] = counts.get(w, 0) + 1

items = list(counts.items())
items.sort()
items.sort(key=byFreq, reverse = True)

for i in range(50):
word, count = items[i]
count = count / float(len(counts))
A += [[word, count]]
print A


And its output:

>>> shakespeare()
[['the', 0.12929982922664066], ['and', 0.09148572822639668], ['I', 0.08075140278116613], ['of', 0.07684801171017322], ['to', 0.07562820200048792], ['a', 0.05220785557453037], ['you', 0.04415711149060746], ['in', 0.041717492071236886], ['And', 0.04147353012929983], ['my', 0.04147353012929983], ['is', 0.03927787265186631], ['not', 0.03781410100024396], ['that', 0.0358624054647475], ['it', 0.03366674798731398], ['Macb', 0.03342278604537692], ['with', 0.03269090021956575], ['his', 0.03147109050988046], ['be', 0.03025128080019517], ['The', 0.028787509148572824], ['haue', 0.028543547206635766], ['me', 0.027079775555013418], ['your', 0.02683581361307636], ['our', 0.025128080019516955], ['him', 0.021956574774335203], ['Enter', 0.019516955354964626], ['That', 0.019516955354964626], ['for', 0.01927299341302757], ['this', 0.01927299341302757], ['he', 0.018541107587216395], ['To', 0.01780922176140522], ['so', 0.017077335935594046], ['all', 0.0156135642839717], ['What', 0.015369602342034643], ['are', 0.015369602342034643], ['thou', 0.015369602342034643], ['will', 0.015125640400097584], ['Macbeth', 0.014881678458160527], ['thee', 0.014881678458160527], ['But', 0.014637716516223469], ['but', 0.014637716516223469], ['Macd', 0.014149792632349353], ['they', 0.014149792632349353], ['their', 0.013905830690412296], ['we', 0.013905830690412296], ['as', 0.01341790680653818], ['vs', 0.01341790680653818], ['King', 0.013173944864601122], ['on', 0.013173944864601122], ['yet', 0.012198097096852892], ['Rosse', 0.011954135154915833], ['the', 0.15813168261114238], ['I', 0.14279684862127182], ['and', 0.1231007315700619], ['to', 0.10875070343275182], ['of', 0.10481148002250985], ['a', 0.08581879572312887], ['you', 0.08581879572312887], ['my', 0.06992121553179516], ['in', 0.061902082160945414], ['is', 0.05852560495216657], ['not', 0.05486775464265616], ['it', 0.05472706809229038], ['that', 0.05472706809229038], ['his', 0.04727068092290377], ['your', 0.04389420371412493], ['me', 0.043753517163759144], ['be', 0.04305008441193022], ['And', 0.04037703995498031], ['with', 0.038266741699493526], ['him', 0.037703995498030385], ['for', 0.03601575689364097], ['he', 0.03404614518851998], ['The', 0.03137310073157006], ['this', 0.030810354530106922], ['her', 0.029262802476083285], ['will', 0.0291221159257175], ['so', 0.027011817670230726], ['have', 0.02687113111986494], ['our', 0.02687113111986494], ['but', 0.024760832864378166], ['That', 0.02293190770962296], ['PAROLLES', 0.022791221159257174], ['To', 0.021384355655599326], ['all', 0.021384355655599326], ['shall', 0.021102982554867755], ['are', 0.02096229600450197], ['as', 0.02096229600450197], ['thou', 0.02039954980303883], ['Macb', 0.019274057400112548], ['thee', 0.019274057400112548], ['no', 0.01871131119864941], ['But', 0.01842993809791784], ['Enter', 0.01814856499718627], ['BERTRAM', 0.01758581879572313], ['HELENA', 0.01730444569499156], ['we', 0.01730444569499156], ['do', 0.017163759144625774], ['thy', 0.017163759144625774], ['was', 0.01674169949352842], ['haue', 0.016460326392796848], ['I', 0.19463784682531435], ['the', 0.17894627455055595], ['and', 0.1472513769094877], ['to', 0.12989712147978802], ['of', 0.12002494024732412], ['you', 0.1079704873739998], ['a', 0.10339810869791126], ['my', 0.0909279850358516], ['in', 0.07627558973293151], ['not', 0.07159929335965914], ['is', 0.0697287748103502], ['it', 0.0676504208666736], ['that', 0.06733866777512211], ['me', 0.06099968824690845], ['your', 0.0543489556271433], ['And', 0.053205860958121166], ['be', 0.05310194326093734], ['his', 0.05154317780317988], ['with', 0.04769822300737816], ['him', 0.04665904603553985], ['her', 0.04364543281720877], ['for', 0.04322976202847345], ['he', 0.042190585056635144], ['this', 0.04187883196508366], ['will', 0.035332017042502335], ['Iago', 0.03522809934531851], ['so', 0.03356541619037722], ['The', 0.03325366309882573], ['haue', 0.031902733035435935], ['do', 0.03138314454951678], ['but', 0.030240049880494647], ['That', 0.02857736672555336], ['thou', 0.027642107450898887], ['as', 0.027434272056531227], ['To', 0.026810765873428243], ['our', 0.02504416502130313], ['are', 0.024628494232567806], ['But', 0.024420658838200146], ['all', 0.024316741141016316], ['What', 0.024212823443832486], ['shall', 0.024004988049464823], ['on', 0.02265405798607503], ['thee', 0.022134469500155875], ['Enter', 0.021822716408604385], ['thy', 0.021199210225501402], ['no', 0.020783539436766082], ['she', 0.02026395095084693], ['am', 0.02005611555647927], ['by', 0.019848280162111608], ['have', 0.019848280162111608]]


Instead of outputing the top 50 words of all three texts, its outputs the top 50 words of each text, 150 words. Im struggling on trying to delete the duplicates but add their ratios together. For example, in macbeth.txt the word 'the' has a ratio of 0.12929982922664066, allswell.txt has a ratio of 0.15813168261114238, and othello.txt has a ratio of 0.17894627455055595. I want to combine the ratios of all three of them. I;m pretty sure I have to use a for loop but I'm struggling to loop through a list within a list. I am more of a java guy so any help would be appreciated!

Answer

You're summarizing the count inside your loop over files. Move the summary code outside your for loop.