user1043144 user1043144 - 1 month ago 12
Python Question

pandas dataframe from a nested dictionary (elasticsearch result)

I am having hard time translating results from elasticsearch aggregations to pandas.
I am trying to write an abstract function which would take nested dictionary (arbitrary number of levels) and flatten them into a pandas dataframe

Here is how a typical result look like

-- edit : I added the parent key as well

x1 = {u'xColor': {u'buckets': [{u'doc_count': 4,
u'key': u'red',
u'xMake': {u'buckets': [{u'doc_count': 3,
u'key': u'honda',
u'xCity': {u'buckets': [{u'doc_count': 2, u'key': u'ROME'},
{u'doc_count': 1, u'key': u'Paris'}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}},
{u'doc_count': 1,
u'key': u'bmw',
u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Paris'}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}},
{u'doc_count': 2,
u'key': u'blue',
u'xMake': {u'buckets': [{u'doc_count': 1,
u'key': u'ford',
u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Paris'}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}},
{u'doc_count': 1,
u'key': u'toyota',
u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Berlin'}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}},
{u'doc_count': 2,
u'key': u'green',
u'xMake': {u'buckets': [{u'doc_count': 1,
u'key': u'ford',
u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Berlin'}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}},
{u'doc_count': 1,
u'key': u'toyota',
u'xCity': {u'buckets': [{u'doc_count': 1, u'key': u'Berlin'}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}}],
u'doc_count_error_upper_bound': 0,
u'sum_other_doc_count': 0}}


what I would like to have is a dataframe with the doc_count of the lowest level

for the first record

red-honda-rome-2

red-honda-paris-1

red-bmw-paris-1


I came across json_normalize in pandas here but do not understand how to put the arguments and I and have seen different suggestions for flattening a nested dictionary but can't really understand how they work.
Any help to get me started would be appreciated
Elasticsearch result to table

UPDATE

I tried to use dpath which is a great library, but I do not see how to abstract this (in form of a function which takes just the buckets names as arguments) as dpath cannot handle the structure in which values are lists (and not other dictionaries)

import dpath
import pandas as pd

xListData = []
for q1 in dpath.util.get(x1, 'xColor/buckets'):
xColor = q1['key']
for q2 in dpath.util.get(q1, 'xMake/buckets'):
#print '--', q2['key']
xMake = q2['key']
for q3 in dpath.util.get(q2, 'xCity/buckets'):
#xDict = []
xCity = q3['key']
doc_count = q3['doc_count']
xDict = {'color': xColor, 'make': xMake, 'city': xCity, 'doc_count': doc_count}
#print '------', q3['key'], q3['doc_count']
xListData.append(xDict)

pd.DataFrame(xListData)


This gives:

city color doc_count make
0 ROME red 2 honda
1 Paris red 1 honda
2 Paris red 1 bmw
3 Paris blue 1 ford
4 Berlin blue 1 toyota
5 Berlin green 1 ford
6 Berlin green 1 toyota

Answer

Try with a recursive function:

import pandas as pd
def elasticToDataframe(elasticResult,aggStructure,record={},fulllist=[]):
    for agg in aggStructure:
        buckets = elasticResult[agg['key']]['buckets']
        for bucket in buckets:
            record = record.copy()
            record[agg['key']] = bucket['key']
            if 'aggs' in agg: 
                elasticToDataframe(bucket,agg['aggs'],record,fulllist)
            else: 
                for var in agg['variables']:
                    record[var['dfName']] = bucket[var['elasticName']]

                fulllist.append(record)

    df = pd.DataFrame(fulllist)
    return df

Then call the function with your data (x1) and a properly configured 'aggStructure' dict. The nested nature of the data must be reflected in this dict.

aggStructure=[{'key':'xColor','aggs':[{'key':'xMake','aggs':[{'key':'xCity','variables':[{'elasticName':'doc_count','dfName':'count'}]}]}]}]
elasticToDataframe(x1,aggStructure)

enter image description here

Cheers

Comments