Noobie Noobie - 4 months ago 14
Python Question

Pandas : how to get the unique number of values in cells when cells contain lists?

For some mysterious reason I have a dataframe that looks like

index col_weird col_normal
2012-01-01 14:30 ['A','B'] 2
2012-01-01 14:32 ['A','C','D'] 4
2012-01-01 14:36 ['C','D'] 2
2012-01-01 14:39 ['E','B'] 4
2012-01-01 14:40 ['G','H'] 2


I would like to resample my dataframe every 5 minutes, and


  • get the unique number of elements across all the lists in
    col_weird
    ,

  • get the mean of
    col_normal



Of course, using
resample().col_weird.nunique()
would fail for the first task because I want the unique number of elements: that is, between
14:30
and
14:35
I expect this number to be 4, corresponding to A,B,C,D.

Over the same period, the mean of
col_normal
is of course 3.

Any idea how to get that?

Thanks!

Answer

I think you can expand list to Series first:

df = df['col'].apply(pd.Series).stack().reset_index(drop=True, level=1)
print (df)
2012-01-01 14:30    A
2012-01-01 14:30    B
2012-01-01 14:32    A
2012-01-01 14:32    C
2012-01-01 14:32    D
2012-01-01 14:36    C
2012-01-01 14:36    D
2012-01-01 14:39    E
2012-01-01 14:39    B
2012-01-01 14:40    G
2012-01-01 14:40    H
dtype: object

Then use resample:

df = df.resample('1H').nunique()
print (df)
2012-01-01 14:00:00    7
Freq: H, dtype: int64