Georg Heiler Georg Heiler - 2 months ago 29
Python Question

Python deduplicate records - dedupe

I want to use https://github.com/datamade/dedupe to deduplicate some records in python. Looking at their examples

data_d = {}
for row in data:
clean_row = [(k, preProcess(v)) for (k, v) in row.items()]
row_id = int(row['id'])
data_d[row_id] = dict(clean_row)


the dictionary consumes quite a lot of memory compared to e.g. a dictionary created by pandas out of a pd.Datafrmae, or even a normal pd.Dataframe.

If this format is required, how can I convert a pd.Dataframe efficiently to such a dictionary?

edit



Example what pandas generates

{'column1': {0: 1389225600000000000,
1: 1388707200000000000,
2: 1388707200000000000,
3: 1389657600000000000,....


Example what dedupe expects

{'1': {column1: 1389225600000000000, column2: "ddd"},
'2': {column1: 1111, column2: "ddd} ...}

Answer

It appears that df.to_dict(orient='index') will produce the representation you are looking for:

import pandas

data = [[1, 2, 3], [4, 5, 6]]
columns = ['a', 'b', 'c']

df = pandas.DataFrame(data, columns=columns)

df.to_dict(orient='index')

results in

{0: {'a': 1, 'b': 2, 'c': 3}, 1: {'a': 4, 'b': 5, 'c': 6}}