Nishant ranjan Nishant ranjan - 7 months ago 17
Python Question

Most occuring string in data using pandas - n gram data mining

I have a data in following form present in a single column in csv file.

['hhcb', 'hcbc', 'cbcc', 'bccc', 'cccd', 'ccdd', 'cddh']
['fahb', 'ahba', 'hbac', 'bacc']
['hchc', 'chcb', 'hcbh']
['hhhh', 'hhhh', 'hhhc', 'hhcd', 'hcdc', 'cdcc']
['habb', 'abbb', 'bbbb', 'bbbc', 'bbcc', 'bccd', 'ccdh', 'cdhd']


I have to find the most occurring four length string in this data.
Please suggest the way.
(It is an example, the original data is large)

Answer

You can try apply Series for creating DataFrame, then stack and value_counts. Last one possible filter top values is by head or [:5]:

print df
                                                  a
0        [hhcb, hcbc, cbcc, bccc, cccd, ccdd, cddh]
1                          [fahb, ahba, hbac, bacc]
2                                [hchc, chcb, hcbh]
3              [hhhh, hhhh, hhhc, hhcd, hcdc, cdcc]
4  [habb, abbb, bbbb, bbbc, bbcc, bccd, ccdh, cdhd]

print df.a.apply(pd.Series).stack().value_counts()[:1]
hhhh    2
dtype: int64

EDIT:

If you need more top values for each row, you can use:

top = df.a.apply(pd.Series).apply(lambda x:x.value_counts()[:2].index.tolist(), axis=1)
print top
0    [ccdd, hhcb]
1    [bacc, fahb]
2    [hcbh, chcb]
3    [hhhh, hhhc]
4    [bccd, abbb]
dtype: object