brittenb brittenb - 1 month ago 9
Python Question

Filter pandas DataFrame by string length within group

Let's say I have the following data

import pandas as pd
df = pd.DataFrame(data=[[1, 'a'], [1, 'aaa'], [1, 'aa'],
[2, 'bb'], [2, 'bbb'],
[3, 'cc']],
columns=['key', 'text'])

key text
0 1 a
1 1 aaa
2 1 aa
3 2 bb
4 2 bbb
5 3 cc


What I would like to do is group by the
key
variable and sort the data within each group by the length of
text
and end up with a single
Series
of index values to use to reindex the dataframe. I thought I could just do something like this:

df.groupby('key').text.str.len().sort_values(ascending=False).index


But it said I need to use
apply
, so I tried this:

df.groupby('key').apply(lambda x: x.text.str.len().sort_values(ascending=False).index, axis=1)


But that told me that
lambda
got an unexpected keyword:
axis
.

I'm relatively new to pandas, so I'm not sure how to go about this. Also, my goal is to simply deduplicate the data such that for each
key
, I keep the value with the longest value of
text
. The expected output is:

key text
1 1 aaa
4 2 bbb
5 3 cc


If there's an easier way to do this than what I'm attempting, I'm open to that as well.

Answer Source
df.groupby('key', as_index=False).apply(lambda x: x[x.text.str.len() == x.text.str.len().max()])

Output:

     key text
0 1    1  aaa
1 4    2  bbb
2 5    3   cc