severian severian - 3 years ago 78
Python Question

Python Pandas -- merging mostly duplicated rows

Some of my data looks like:

date, name, value1, value2, value3, value4
1/1/2001,ABC,1,1,,
1/1/2001,ABC,,,2,
1/1/2001,ABC,,,,35


I am trying to get to the point where I can run

data.set_index(['date', 'name'])


But, with the data as-is, there are of course duplicates (as shown in the above), so I cannot do this (and I don't want an index with duplicates, and I can't simply drop_duplicates(), since this would lose data).

I would like to be able to force rows which have the same [date, name] values into a single rows, if they can be successfully converged based on certain values being NaN (similar to the behavior of combine_first()). E.g., the above would end up at

date, name, value1, value2, value3, value4
1/1/2001,ABC,1,1,2,35


If two values are different and one is not NaN, the two rows should not be converged (this would probably be an error that I would need to follow up on).

(To extend the above example, there may in fact be an arbitrary number of lines--given an arbitrary number of columns--which should be able to be converged into one single line.)

This feels like a problem that should be very solvable via pandas, but I am having trouble figuring out an elegant solution.

Answer Source

Let's imagine you have some function combine_it that, given a set of rows that would have duplicate values, returns a single row. First, group by date and name:

grouped = data.groupby(['date', 'name'])

Then just apply the aggregation function and boom you're done:

result = grouped.agg(combine_it)

You can also provide different aggregation functions for different columns by passing agg a dict.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download