Jason Collis Jason Collis - 1 month ago 7
JSON Question

Python - Only reading last line of file in specific circumstance

I'm trying to process some tweets using Python, and I'm trying to do a word count of the most popular words contained in 7 different tweets. I have my file set up, each tweet is a JSON object on its own line, and when I try to print out each tweet using the following, it works perfectly:

with open(fname, 'r') as f:
for line in f:
tweet = json.loads(line) # load it as Python dict
print(json.dumps(tweet, indent=4))


However, when I am trying to do something similar in my word count, it either reads the last line of the file 7 times, or just the last line of the file once. I am using the following code, removing stopwords from the results:

with open(fname, 'r', encoding='utf8') as f:
count_all = Counter()
# Create a list with all the terms
terms_stop = [term for term in tokens if term not in stop]
for line in f:
# Update the counter
count_all.update(terms_stop)
# Print the first 5 most frequent words
print(count_all.most_common(5))


The above produces 5 random words from the last tweet, and the count of each one is at 7 - meaning that it essentially read the last tweet 7 times instead of reading each of the 7 tweets once.

The following code is meant to see which words are most commonly grouped together. It produces 5 randomly grouped words from the last tweet, with the count at just 1, which signifies that it only read the last tweet (once) and none of the other tweets.

with open(fname, 'r', encoding='utf8') as f:
count_all = Counter()
# Create a list with all the terms
terms_stop = [term for term in tokens if term not in stop]
# Import Bigrams to group words together
terms_bigram = bigrams(terms_stop)
for line in f:
# Update the counter
count_all.update(terms_bigram)
# Print the first 5 most frequent words
print(count_all.most_common(5))


The format of my json file is as follows:

{"created_at":"Tue Oct 25 11:24:54 +0000 2016","id":4444444444,.....}
{"created_at":..... }
{etc}


Help would be most appreciated! Thanks very much in advance.

UPDATE:
Don't know how I missed it, but thanks for the help everyone! I forgot to include 'line' in my for loop. Here is the working code:

with open(fname, 'r', encoding='utf8') as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
tokens = preprocess(tweet['text'])
# Create a list with all the terms
terms_stop = [term for term in tokens if term not in stop]
# Update the counter
count_all.update(terms_stop)
# Print the first 5 most frequent words
print(count_all.most_common(5))


I just had to combine the tokenizer with the word count.

Answer

Perhaps I am missing something but you never use line in the for-loop:

for line in f:
    # Update the counter
    count_all.update(terms_bigram)

so you are just looping over the lines doing the same thing for each line.

Comments