prometheus2305 prometheus2305 - 4 months ago 47
JSON Question

Python: Trying to Deserialize Multiple JSON objects in a file with each object spanning multiple but consistently spaced number of lines

Ok, after nearly a week of research I'm going to give SO a shot. I have a text file that looks as follows (showing 3 separate json objects as an example but file has 50K of these):

{
"zipcode":"00544",
"current":{"canwc":null,"cig":7000,"class":"observation"},
"triggers":[178,30,176,103,179,112,21,20,48,7,50,40,57]
}
{
"zipcode":"00601",
"current":{"canwc":null,"cig":null,"class":"observation"},
"triggers":[12,23,34,28,100]
}
{
"zipcode":"00602",
"current":{"canwc":null,"cig":null,"class":"observation"},
"triggers":[13,85,43,101,38,31]
}


I know how to work with JSON objects using the Python json library but I'm having a challenge with how to create 50 thousand different json objects from reading the file. (Perhaps I'm not even thinking about this correctly but ultimately I need to deserialize and load into a database) I've tried itertools thinking that I need a generator so I was able to use:

with open(file) as f:
for line in itertools.islice(f, 0, 7): #since every 7 lines is a json object
jfile = json.load(line)


But the above obviously won't work since it is not reading the 7 lines as a single json object and I'm also not sure how to then iterate on entire file and load individual json objects.

The following would give me a list I can slice:

list(open(file))[:7]


Any help would be really appreciated.




Extemely close to what I need and I think literally one step away but still struggling a little with iteration. This will finally get me an iterative printout of all of the dataframes but how do I make it so that I can capture one giant dataframe with all of the pieces essentially concatenated? I could then export that final dataframe to csv etc. (Also is there a better way to upload this result into a database rather than creating a giant dataframe first?)

def lines_per_n(f, n):
for line in f:
yield ''.join(chain([line], itertools.islice(f, n - 1)))

def flatten(jfile):
for k, v in jfile.items():
if isinstance(v, list):
jfile[k] = ','.join(v)
elif isinstance(v, dict):
for kk, vv in v.items():
jfile['%s' % (kk)] = vv
del jfile[k]
return jfile

with open('deadzips.json') as f:
for chunk in lines_per_n(f, 7):
try:
jfile = json.loads(chunk)
pd.DataFrame(flatten(jfile).items())
except ValueError, e:
pass
else:
pass

Answer

Load 6 extra lines instead, and pass the string to json.loads():

with open(file) as f:
    for line in f:
        # slice the next 6 lines from the iterable, as a list.
        lines = [line] + list(itertools.islice(f, 6))
        jfile = json.loads(''.join(lines))

        # do something with jfile

json.load() will slurp up more than just the next object in the file, and islice(f, 0, 7) would read only the first 7 lines, rather than read the file in 7-line blocks.

You can wrap reading a file in blocks of size N in a generator:

from itertools import islice, chain

def lines_per_n(f, n):
    for line in f:
        yield ''.join(chain([line], itertools.islice(f, n - 1)))

then use that to chunk up your input file:

with open(file) as f:
    for chunk in lines_per_n(f, 7):
        jfile = json.loads(chunk)

        # do something with jfile

Alternatively, if your blocks turn out to be of variable length, read until you have something that parses:

with open(file) as f:
    for line in f:
        while True:
            try:
                jfile = json.loads(line)
                break
            except ValueError:
                # Not yet a complete JSON value
                line += next(f)

        # do something with jfile