Nishant ranjan Nishant ranjan - 2 years ago 197
Python Question

Most occuring string in data using pandas - n gram data mining

I have a data in following form present in a single column in csv file.

['hhcb', 'hcbc', 'cbcc', 'bccc', 'cccd', 'ccdd', 'cddh']
['fahb', 'ahba', 'hbac', 'bacc']
['hchc', 'chcb', 'hcbh']
['hhhh', 'hhhh', 'hhhc', 'hhcd', 'hcdc', 'cdcc']
['habb', 'abbb', 'bbbb', 'bbbc', 'bbcc', 'bccd', 'ccdh', 'cdhd']

I have to find the most occurring four length string in this data.
Please suggest the way.
(It is an example, the original data is large)

Answer Source

You can try apply Series for creating DataFrame, then stack and value_counts. Last one possible filter top values is by head or [:5]:

print df
0        [hhcb, hcbc, cbcc, bccc, cccd, ccdd, cddh]
1                          [fahb, ahba, hbac, bacc]
2                                [hchc, chcb, hcbh]
3              [hhhh, hhhh, hhhc, hhcd, hcdc, cdcc]
4  [habb, abbb, bbbb, bbbc, bbcc, bccd, ccdh, cdhd]

print df.a.apply(pd.Series).stack().value_counts()[:1]
hhhh    2
dtype: int64


If you need more top values for each row, you can use:

top = df.a.apply(pd.Series).apply(lambda x:x.value_counts()[:2].index.tolist(), axis=1)
print top
0    [ccdd, hhcb]
1    [bacc, fahb]
2    [hcbh, chcb]
3    [hhhh, hhhc]
4    [bccd, abbb]
dtype: object
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download