nowox nowox - 2 months ago 13
Python Question

Flatten a nested dict structure into a dataset

For some post-processing, I need to flatten a structure like this

{'foo': {
'cat': {'name': 'Hodor', 'age': 7},
'dog': {'name': 'Mordor', 'age': 5}},
'bar': { 'rat': {'name': 'Izidor', 'age': 3}}
}


into this dataset:

[{'foobar': 'foo', 'animal': 'dog', 'name': 'Mordor', 'age': 5},
{'foobar': 'foo', 'animal': 'cat', 'name': 'Hodor', 'age': 7},
{'foobar': 'bar', 'animal': 'rat', 'name': 'Izidor', 'age': 3}]


So I wrote this function:

def flatten(data, primary_keys):
out = []
keys = copy.copy(primary_keys)
keys.reverse()
def visit(node, primary_values, prim):
if len(prim):
p = prim.pop()
for key, child in node.iteritems():
primary_values[p] = key
visit(child, primary_values, copy.copy(prim))
else:
new = copy.copy(node)
new.update(primary_values)
out.append(new)
visit(data, { }, keys)
return out

out = flatten(a, ['foo', 'bar'])


I was not really satisfied because I have to use
copy.copy
to protect my inputs. Obviously, when using
flatten
one does not want the inputs be altered.

Then I thought about one alternative that uses more global variables (at least global to
flatten
) and uses an index instead of directly passing
primary_keys
to
visit
. However, this does not really help me to get rid of the ugly initial copy:

keys = copy.copy(primary_keys)
keys.reverse()


So here is my final version:

def flatten(data, keys):
data = copy.copy(data)
keys = copy.copy(keys)
keys.reverse()
out = []
values = {}
def visit(node, id):
if id:
id -= 1
for key, child in node.iteritems():
values[keys[id]] = key
visit(child, id)
else:
node.update(values)
out.append(node)
visit(data, len(keys))
return out


Is there a better implementation (that can avoid the use of
copy.copy
)?

Answer

Edit: modified to account for variable dictionary depth.

By using the merge function from my previous answer (below), you can avoid calling update which modifies the caller. There is then no need to copy the dictionary first.

def flatten(data, keys):
    out = []
    values = {}
    def visit(node, id):
        if id:
            id -= 1
            for key, child in node.items():
               values[keys[id]] = key
               visit(child, id)
        else:
            out.append(merge(node, values))  # use merge instead of update
    visit(data, len(keys))
    return out     

One thing I don't understand is why you need to protect the keys input. I don't see them being modified anywhere.


Previous answer

How about list comprehension?

def merge(d1, d2):
    return dict(list(d1.items()) + list(d2.items()))

[[merge({'foobar': key, 'animal': sub_key}, sub_sub_dict) 
    for sub_key, sub_sub_dict in sub_dict.items()] 
        for key, sub_dict in a.items()]

The tricky part was merging the dictionaries without using update (which returns None).

Comments