Andres Azqueta Andres Azqueta - 2 months ago 8
Python Question

Parsing date in JSON format using Python

I have a set of news articles in JSON format and I am having problems parsing the date of the data. The problem is that once the articles were converted in JSON format, the date successfully got converted but also did the edition. Here is an illustration:

{"date": "December 31, 1995, Sunday, Late Edition - Final", "body": "AFTER a year of dizzying new heights for the market, investors may despair of finding any good stocks left. Navistar plans to slash costs by $112 million in 1996. Advanced Micro Devices has made a key acquisition. For the bottom-fishing investor, therefore, the big nail-biter is: Will the changes be enough to turn a company around? ", "title": "INVESTING IT;"}
{"date": "December 31, 1995, Sunday, Late Edition - Final", "body": "Few issues stir as much passion in so many communities as the simple act of moving from place to place: from home to work to the mall and home again. It was an extremely busy and productive year for us, said Frank J. Wilson, the State Commissioner of Transportation. There's a sense of urgency to get things done. ", "title": "ROAD AND RAIL;"}
{"date": "December 31, 1996, Sunday, Late Edition - Final", "body": "Widespread confidence in the state's economy prevailed last January as many businesses celebrated their most robust gains since the recession. And Steven Wynn, the chairman of Mirage Resorts, who left Atlantic City eight years ago because of local and state regulations, is returning to build a $1 billion two-casino complex. ", "title": "NEW JERSEY & CO.;"}


Since I am aiming at counting the number of articles that contain certain words I loop the articles in the following way:

import json
import re
import pandas

for i in range(1995,2017):
df = pandas.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
# Parse dates and set index
df.date = pandas.to_datetime(df.date) # is giving me a problem
df.set_index('date', inplace=True)


I am looking after orientation on how to tackle the problem in the most efficient way. I was thinking of something such "ignore anything that goes after the date of the week" when parsing the date. Is there such thing?

Thanks in advance

Answer

You can split column date by str.split, concanecate first and second column - month, day and year together (December 31 and 1995)and last call to_datetime:

for i in range(1995,2017):
    df = pandas.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
    # Parse dates and set index
    #print (df)
    a = df.date.str.split(', ', expand=True)
    df.date = a.iloc[:,0] + ' ' + a.iloc[:,1]
    df.date = pandas.to_datetime(df.date) 
    df.set_index('date', inplace=True)
    print (df)

                                                    body  \
date                                                            
1995-12-31  AFTER a year of dizzying new heights for the m...   
1995-12-31  Few issues stir as much passion in so many com...   
1996-12-31  Widespread confidence in the state's economy p...   

                        title  
date                           
1995-12-31      INVESTING IT;  
1995-12-31     ROAD AND RAIL;  
1996-12-31  NEW JERSEY & CO.;  
Comments