dan martin dan martin - 3 months ago 11
JSON Question

Python - json comparison using sets only working occasionally

I have a pretty peculiar problem on my hands. I'm not too experienced with python (my language of choice being swift for mobile development), but what I have to do for this project is to pull some csv files from a database, download them locally and upload them to Amazon's DynamoDB.

I have managed to get everything working - The program downloads the csv file as a zip, extracts it using zipfile, converts the csv file to a json file, and then begins uploading the json to DynamoDB.

However, these csv files contain around 100,000 rows each, and to reupload each item every time makes no sense when only 5-10 items are changed in the csv file daily. So, what I've decided to do is before uploading the new json to DynamoDB, get the program to compare the new json to the old json, get only the new items, and upload those.

Now, to get on to the actual problem. What i've been attempting is this:

import json

with open ("C:\\Users\Me\Desktop\staff\oldfile.json") as json1:
list1 = json.load(json1)
with open ("C:\\Users\Me\Desktop\staff\newfile.json") as json2:
list2 = json.load(json2)

set_1 = set(repr(x) for x in list1)
set_2 = set(repr(x) for x in list2)

differences = (set_2 - set_1)
print(differences)


Which actually works pretty well. The result will be set() if the sets are identical, or contain only the new additional items.

However

I have noticed that when I convert the csv file to json, the orders of the sets change between the two objects in the different files. For example, in the first json file an object might be:

[{"name": "jack", "id": "3100", "photo": "http://imagesdatabase.com/is/image/jack/I_063017263_50_20141112", "category": "male employees", "commissions": "4500", "department": "Beauty > Skincare", "department_id": "709010788", "store_id": "", "additional duties": "5", "spreadsheet": "http://spreadsheetdatabase.com/previpew/01/32100/88/07/709310788.csv", "description": "Jack is a talented young man, has worked with us for over three years and, although initially starting slowly, has worked his way up to becoming the top earner of the month several times.", "join_date": "12/5/2008", "mornings": "YES", "staff_link": "http://staffdatabase.com/244234/654", "show": "NO", "retailers_id": "6017263", "head_id": "2909", "products_sold": "Skincare", "commissions_report": "http://commissionsdatabase.com/jck1/2453"}]


This same object in the new json file might be:

[{"id": "3100", "name": "jack", "photo": "http://imagesdatabase.com/is/image/jack/I_063017263_50_20141112", "category": "male employees", "commissions": "4500", "department": "Beauty > Skincare", "department_id": "709010788", "store_id": "", "additional duties": "5", "spreadsheet": "http://spreadsheetdatabase.com/previpew/01/32100/88/07/709310788.csv", "description": "Jack is a talented young man, has worked with us for over three years and, although initially starting slowly, has worked his way up to becoming the top earner of the month several times.", "join_date": "12/5/2008", "mornings": "YES", "staff_link": "http://staffdatabase.com/244234/654", "show": "NO", "retailers_id": "6017263", "head_id": "2909", "products_sold": "Skincare", "commissions_report": "http://commissionsdatabase.com/jck1/2453"}]


These are both still the same object, no?

But when I try to compare these two using python sometimes I get set(), and sometimes it tries to tell me that it's a new object - what's happening?

json comparison fail

Honestly, I've been troubleshooting this for almost a whole day now and I'm pretty much at my wit's end - I really can't understand why it would work when I run it once, and then not the next time with the same exact json objects. Any help would be greatly appreciated!

Answer

Your code relies on the ordering of dictionaries. Dictionary order depends on insertion and deletion history and should not be relied upon.

If your dictionaries are not nested, you can store them in sets as tuples of their key-value pairs, sorted:

set_1 = set(tuple(sorted(x.items())) for x in list1)
set_2 = set(tuple(sorted(x.items())) for x in list2)

This creates an immutable representation that retains the original key-value pairing but avoids any issues with ordering. These tuples can trivially be fed back into the dict() type to re-create the dictionary.