ganesa75 ganesa75 - 6 months ago 27
JSON Question

Sort JSON dictionaries using datetime format not consistent

I have JSON file (post responses from an API) - I need to sort the dictionaries by a certain key in order to parse the JSON file in chronological order. After studying the data, I can sort it by the date format in metadata or by the number sequences of the S5CV[0156]P0.xml

One text example that you can load in JSON here -

I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by [metadata][0][value].

The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.

For the 2nd code the format of date is not consistent and sometimes the value is not present at all. I am struggling to extract the datetime format in a consistent way. The second one also gives me an error, but I cannot figure out why - string indices must be integers.

# 1st code (it works but not ideal)

# load post response r1 in json (python 3.5)

# iterate through dictionaries and sort by the 4 num of xml (ex. 0156)

list = []
for row in j["tree"]["children"][0]["children"]:

newlist = sorted(list, key=lambda k: k['text'][-9:])

# 2nd code. I need something to make consistent datetime,
# except missing values and solve the list index error

list = []
for row in j["tree"]["children"][0]["children"]:

# extract the last 3 blocks of characters from the [metadata][0][value]
# usually are like this "7th april, 1922." and trasform in datatime format
# using dparser.parse

def date(key):
return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)

def order(slist):
return sorted(slist, key=lambda k: k[date(["metadata"][0]["value"])])

except ValueError:
return 0



Clearly you can't use the .xml filenames to sort the data if it's unreliable, so the most promising strategy seems to be what you're attempting to do in the 2nd code.

When I mentioned needing a datetime to sort the items in my comments to your other question, I literally meant something like instances, not strings like "28th july, 1933", which wouldn't provide the proper ordering needed since they would be compared lexicographically with one another, not numerically like datetime.dates.

Here's something that seems to work. It uses the re module to search for the date pattern in the strings that usually contain them (those with a "name" associated with the value "Comprising period from"). If there's more than one date match in the string, it uses the last one. This is then converted into a date instance and returned as the value to key on.

Since some of the items don't have valid date strings, a default one is substituted for sorting purposes. In the code below, a earliest valid date is used as the default—which makes all items with date problems appear at the beginning of the sorted list. Any items following them should be in the proper order.

Not sure what you should do about items lacking date information—if it isn't there, your only options are to guess a value, ignore them, or consider it an error.

# v3.2
import datetime
import json
import re

# default date when one isn't found
DEFAULT_DATE =, 1, datetime.MINYEAR)  # 01/01/0001
MONTHS = ('january february march april may june july august september october'
          ' november december'.split())
# dictionary to map month names to numeric values 1-12
MONTH_TO_ORDINAL = dict( zip(MONTHS, range(1, 13)) )
DMY_DATE_REGEX = ('(3[01]|[12][0-9]|[1-9])\\s*(?:st|nd|rd|th)?\\s*'
                + '(' + '|'.join(MONTHS) + ')(?:[,.])*\\s*'
                + '([0-9]{4})')
MDY_DATE_REGEX = ('(' + '|'.join(MONTHS) + ')\\s+'
                + '(3[01]|[12][0-9]|[1-9])\\s*(?:st|nd|rd|th)?,\\s*'
                + '([0-9]{4})')

def extract_date(item):
    metadata0 = item["metadata"][0]  # check only first item in metadata list
    if metadata0.get("name") != "Comprising period from":
        return DEFAULT_DATE
        value = metadata0.get("value", "")
        matches = DMY_DATE.findall(value)  # try dmy pattern (most common)
        if matches:
            day, month, year = matches[-1]  # use last match if more than one
            matches = MDY_DATE.findall(value)  # try mdy pattern...
            if matches:
                month, day, year = matches[-1]  # use last match if more than one
                print('warning: date patterns not found in "{}"'.format(value))
                return DEFAULT_DATE

        # convert strings found into numerical values
        year, month, day = int(year), MONTH_TO_ORDINAL[month.lower()], int(day)
        return, month, day)

# test files: 'json_sample.txt', 'india_congress.txt', 'olympic_games.txt'
with open('json_sample.txt', 'r') as f:
    j = json.load(f)

orig_list = j["tree"]["children"][0]["children"]
sorted_list = sorted(orig_list, key=extract_date)
for item in sorted_list:
    print(json.dumps(item, indent=4))

P.S. I recommend you don't use variable names, list list, which conflict with the those of Python's own built-in classes and functions. It can cause serious headaches when developing and debugging your code.