dedpo dedpo - 7 months ago 1010
Python Question

Converting JSON to a Pandas DataFrame

I am having some trouble with this.
I am trying to write this JSON to DataFrame. I feel like my issue is how i am formatting the JSON. When i write each tweet. However not able to narrow it down. Any insight would be awesome. Attached is my raw_tweets.json and 2nd code blow below is how i am writing it, seperating by comma i.e join (',')

HERE is the LINK TO raw_tweets.json

i get a raise JSONDecodeError("Extra data", s, end)

JSONDecodeError: Extra data


#JSON to DataFrame

class tweet2dframe(object):

def __init__(self, text="", location=""):
self.text = text
self.location = location

def getText(self):

return self.text

def getLocation(self):

return self.location



# import json package to load json file
with open('raw_tweets.json',encoding="utf8") as jsonFile:
polls_json = json.loads(jsonFile.read())




tweets_list = [polls(i["location"], i["text"]) for i in polls_json['text']]

colNames = ("Text", "location")
dict_list = []


for i in tweets_list:
dict_list.append(dict(zip(colNames , [i.getText(), i.getLocation()])))


tweets_df = pd.DataFrame(dict_list)
tweets_df.head()


THE way I write my tweets to JSON

saveFile = io.open('raw_tweets.json', 'w', encoding='utf-8')
saveFile.write(','.join(self.tweet_data))
saveFile.close()
exit()

Answer

raw_tweets.json contains invalid JSON. It contains JSON snippets separated by commas. To make the whole text a valid JSON array, place brackets [...] around the contents:

with open('raw_tweets.json', encoding="utf8") as jsonFile:
    polls_json = json.loads('[{}]'.format(jsonFile.read()))

For example,

import pandas as pd
import json

with open('raw_tweets.json', encoding="utf8") as jsonFile:
    polls_json = json.loads('[{}]'.format(jsonFile.read()))

tweets_list = [(dct['user']['location'], dct["text"]) for dct in polls_json]
colNames = ("location", "text")
tweets_df = pd.DataFrame(tweets_list, columns=colNames)
print(tweets_df.head())

yields

        location                                               text
0           None  RT @webseriestoday: Democracy Now: Noam Chomsk...
1  Pittsburgh PA  "The tuxedo was an invention of the Koch broth...
2           None  RT @webseriestoday: Democracy Now: Noam Chomsk...
3           None  RT @webseriestoday: Democracy Now: Noam Chomsk...

Another, better way to fix the problem would be to write valid JSON in raw_tweets.json. After all, if you wanted to send the file to someone else, you'll make their life easier if the file contained valid JSON. We'd need to see more of your code to suggest exactly how to fix it, but in general you would want to use json.dump to write a list of dicts as JSON to a file instead of "manually" writing JSON snippets with saveFile.write(','.join(self.tweet_data)):

tweets = []
for i in loop:
    tweets.append(tweet_dict)
with io.open('raw_tweets.json', 'w', encoding='utf-8') as saveFile:
    json.dump(tweets, saveFile)

If raw_tweets.json contained valid JSON then you could load it into a Python list of dicts using:

with open('raw_tweets.json', encoding="utf8") as jsonFile:
    polls_json = json.loads(jsonFile)

The rest of the code, to load the desired parts into a DataFrame would remain the same.


How was this line of code constructed:

tweets_list = [(dct['user']['location'], dct["text"]) for dct in polls_json]

In an interactive Python session I inspected one dict in polls_json:

In [114]: import pandas as pd
In [115]: import json
In [116]: with open('raw_tweets.json', encoding="utf8") as jsonFile:
    polls_json = json.loads('[{}]'.format(jsonFile.read()))
In [117]: dct = polls_json[1]
In [118]: dct
Out[118]: 
{'contributors': None,
 'coordinates': None,
 ...
  'text': "Like the old Soviet leaders, Bernie refused to wear a tux at last night's black-tie dinner.",
  'truncated': False,
  'user': {'contributors_enabled': False,
  ...
   'location': 'Washington DC',}}

It is quite large, so I've omitted parts of it here to make the result more readable. Assuming that I correctly guessed the text and location values you are looking for, we can see that given this dict, dct, we can access the desired text value using dct['text']. But the location' key is inside the nested dict, dct['user']. Therefore, we need to use dct['user']['location'] to extract the location value.

By the way, Pandas provides a convenient method for reading JSON into a DataFrame, pd.read_json, but it relies on the JSON data being "flat". Because the data we desire is in nested dicts, I used custom code, the list comprehension

tweets_list = [(dct['user']['location'], dct["text"]) for dct in polls_json]

to extract the values instead of pd.read_json.

Comments