ganesa75 ganesa75 - 3 years ago 132
JSON Question

Recursive walk trough a JSON file extracting SELECTED strings

I need to recursively walk trough a JSON files (post responses from an API), extracting the strings that have ["text"] as a key

{"text":"this is a string"}

I need to start to parse from the source that has the oldest date in metadata, extract the strings from that source and then move to the 2nd oldest source and so on. JSON file could be badly nested and the level where the strings are can change from time to time.

There are many keys called ["text"] and I don't need all of them, I need ONLY the ones having values as string. Better, the "text":"string" I need are ALWAYS in the same object {} of a
. See image.

What I am asking

Modify the 2nd code below in order to recursively walk the file and extract ONLY the ["text"] values when they are in the same object {} together with "type":"sentence".

Below a snippet of JSON file (in green the text I need and the medatada, in red the ones I don't need to extract):

screenshot of contents of JSON file

Link to full JSON sample:

What I have done so far:

1) The easy way: transform the json file in string and search for content between the double quotes ("") because in all json post responses the "strings" I need are the only ones that come between double quotes. However this option prevent me to order the resources previously, therefore is not good enough.

r1 =, data=payload1)
j = str(r1.json())

sentences_list = (re.findall(r'\"(.+?)\"', j))

numentries = 0
for sentences in sentences_list:
numentries += 1

2) Smarter way: recursively walk trough a JSON file and extract the ["text"] values

def get_all(myjson, key):
if type(myjson) is dict:
for jsonkey in (myjson):
if type(myjson[jsonkey]) in (list, dict):
get_all(myjson[jsonkey], key)
elif jsonkey == key:
print (myjson[jsonkey])
elif type(myjson) is list:
for item in myjson:
if type(item) in (list, dict):
get_all(item, key)

print(get_all(r1.json(), "text"))

It extracts all the values that have ["text"] as Key. Unfortunately in the file there are other stuff (that I don't need) that has ["text"] as Key. Therefore it returns text that I don't need.

Please advise.


I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by 'Comprising period from' value.

The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.

For the 2nd code the format of 'Comprising period from' is not consistent and sometimes the value is not present at all. The second one also gives me an error, but I cannot figure out why -
string indices must be integers

# 1st code (it works but not ideal)


list = []
for row in j["tree"]["children"][0]["children"]:

newlist = sorted(list, key=lambda k: k['text'][-9:])

# 2nd code I need something to expect missing values and to solve the
# list index error
list = []
for row in j["tree"]["children"][0]["children"]:

def date(key):
return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)

def order(list_to_order):
return sorted(list_to_order,
key=lambda k: k[date(["metadata"][0]["value"])])
except ValueError:
return 0


Answer Source

I think this will do what you want, as far as selecting the right strings. I also changed the way type-checking was done to use isinstance(), which is considered a better way to do it because it supports object-oriented polymorphism.

import json
_NUL = object()  # unique value guaranteed to never be in JSON data

def get_all(myjson, kind, key):
    """ Recursively find all the values of key in all the dictionaries in myjson
        with a "type" key equal to kind.
    if isinstance(myjson, dict):
        key_value = myjson.get(key, _NUL)  # _NUL if key not present
        if key_value is not _NUL and myjson.get("type") == kind:
            yield key_value
        for jsonkey in myjson:
            jsonvalue = myjson[jsonkey]
            for v in get_all(jsonvalue, kind, key):  # recursive
                yield v
    elif isinstance(myjson, list):
        for item in myjson:
            for v in get_all(item, kind, key):  # recursive
                yield v    

with open('json_sample.txt', 'r') as f:
    data = json.load(f)

numentries = 0
for text in get_all(data, "sentence", "text"):
    numentries += 1

print('\nNumber of "text" entries found: {}'.format(numentries))
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download