David David - 2 months ago 8
Python Question

Pandas count frequencies within str series

Given a Pandas Series of type str, I want to get the frequencies of the result returned by str.split.

For example, given the Series

s = pd.Series(['abc,def,ghi','ghi,abc'])


I would like to get

abc: 2
def: 1
ghi: 2


as a result. How can I get this?

Edit: The solution should efficiently work with a large Series of 50 million rows.

Answer

Another pandas solution with str.split, sum and value_counts :

print pd.Series(s.str.split(',').sum()).value_counts()
abc    2
ghi    2
def    1
dtype: int64

EDIT:

More efficent methods:

import pandas as pd
s = pd.Series(['abc,def,ghi','ghi,abc'])
s = pd.concat([s]*10000).reset_index(drop=True)

In [17]: %timeit pd.Series(s.str.split(',').sum()).value_counts()
1 loops, best of 3: 3.1 s per loop

In [18]: %timeit s.str.split(',', expand=True).stack().value_counts()
10 loops, best of 3: 46.2 ms per loop

In [19]: %timeit pd.DataFrame([ x.split(',') for x in s.tolist() ]).stack().value_counts()
10 loops, best of 3: 22.2 ms per loop

In [20]: %timeit pd.Series([item for sublist in [ x.split(',') for x in s.tolist() ] for item in sublist]).value_counts()
100 loops, best of 3: 16.6 ms per loop