Evgeni Evgeni - 1 year ago 79
Python Question

Python Pandas aggregating values

I have a dataframe like this:

0 A\nA\nA
1 na\nB|D|E|F|G|H\nB|D|E|F|G|H
2 B\nB|c\nB
3 na\nna\nna

And I would like to aggregate the values by highest count :

0 A
1 B|D|E|F|G|H
2 B
3 na

I assume I should delimit the column first by /n so I am using

df = pd.DataFrame([ x.split('\n') for x in df.tolist()])

So I get:

0 1 2
0 A A A
1 na B|D|E|F|G|H B|D|E|F|G|H
2 B B|C B
3 na na na

How can I merge the columns next to get the desired output?


Answer Source

You can use Counter with most_common:

from collections import Counter

df = pd.DataFrame([Counter(x.split('\n')).most_common(1)[0][0] for x in df.tolist()])
print (df)
0            A
1  B|D|E|F|G|H
2            B
3           na

Another solution with str.split and apply value_counts:

df = df.str.split('\n', expand=True).apply(lambda x: pd.value_counts(x).index[0],axis=1)
print (df)
0              A
1    B|D|E|F|G|H
2              B
3             na
dtype: object


In [238]: %timeit (pd.DataFrame([Counter(x.split('\n')).most_common(1)[0][0] for x in df.tolist()]))
1000 loops, best of 3: 197 ┬Ás per loop

In [239]: %timeit (df.str.split('\n', expand=True).apply(lambda x: pd.value_counts(x).index[0],axis=1))
100 loops, best of 3: 2.33 ms per loop

In [241]: %timeit (pd.DataFrame([ x.split('\n') for x in df.tolist()]).mode(1))
100 loops, best of 3: 2.32 ms per loop
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download