severian severian - 3 years ago 78
Python Question

Python Pandas -- merging mostly duplicated rows

Some of my data looks like:

date, name, value1, value2, value3, value4

I am trying to get to the point where I can run

data.set_index(['date', 'name'])

But, with the data as-is, there are of course duplicates (as shown in the above), so I cannot do this (and I don't want an index with duplicates, and I can't simply drop_duplicates(), since this would lose data).

I would like to be able to force rows which have the same [date, name] values into a single rows, if they can be successfully converged based on certain values being NaN (similar to the behavior of combine_first()). E.g., the above would end up at

date, name, value1, value2, value3, value4

If two values are different and one is not NaN, the two rows should not be converged (this would probably be an error that I would need to follow up on).

(To extend the above example, there may in fact be an arbitrary number of lines--given an arbitrary number of columns--which should be able to be converged into one single line.)

This feels like a problem that should be very solvable via pandas, but I am having trouble figuring out an elegant solution.

Answer Source

Let's imagine you have some function combine_it that, given a set of rows that would have duplicate values, returns a single row. First, group by date and name:

grouped = data.groupby(['date', 'name'])

Then just apply the aggregation function and boom you're done:

result = grouped.agg(combine_it)

You can also provide different aggregation functions for different columns by passing agg a dict.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download