Satrio Adi Prabowo Satrio Adi Prabowo - 3 years ago 212
Python Question

How to make list of list from dataframe pandas?

I have a Pandas dataframe with words and tags

words tags
0 I WW
1 am XX
2 newbie YY
3 . ZZ
4 You WW
5 are XX
6 cool YY
7 . ZZ


Is there any method on how do I create list of list from the dataframe something like this:

[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.','ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.','ZZ')]]


It is a list of lists of tuples. For each list inside the list are separated by
('.','ZZ')
. Mean that it is a sentence.

I can iterate on each rows of dataframe and create list and append it if the condition is true, but is there any 'pandas' way to solve it?

Answer Source

You can first create tuples from all values and then separate them to sublists if performance is important:

from  itertools import groupby

L = list(zip(df['words'], df['tags']))
print (L)
[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), 
 ('.', 'ZZ'), ('You', 'WW'), ('are', 'XX'), 
 ('cool', 'YY'), ('.', 'ZZ')]

sep = ('.','ZZ')
new_L = [list(g) + [sep] for k, g in groupby(L, lambda x: x==sep) if not k] 
print (new_L)

[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')], 
 [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]

Timings:

df = pd.concat([df]*1000).reset_index(drop=True)

def zero(df):
    dft = df.apply(tuple, 1)
    return ([x.values.tolist() for _, x in dft.groupby((dft == ('.', 'ZZ')).shift().cumsum().bfill())])

In [55]: %timeit ([list(g) + [('.','ZZ')] for k, g in groupby(list(zip(df['words'], df['tags'])), lambda x: x==('.','ZZ')) if not k] )
100 loops, best of 3: 4.14 ms per loop

def pir(df):
    v = df.values
    return ([list(map(tuple, x)) for x in np.split(v, np.where((v == ['.', 'ZZ']).all(1)[:-1])[0] + 1)])

In [68]: %timeit (pir(df))
10 loops, best of 3: 21.9 ms per loop


In [56]: %timeit (zero(df))
1 loop, best of 3: 328 ms per loop

In [57]: %timeit (df.groupby((df.shift().values == ['.', 'ZZ']).all(axis=1).cumsum()).apply(lambda group: list(zip(group['words'], group['tags']))).values.tolist())
1 loop, best of 3: 286 ms per loop

In [58]: %timeit (list(filter(None,[i.apply(tuple,1).values.tolist() for i in np.array_split(df,df[(df['words'] == '.') & (df['tags'] == 'ZZ')].index+1)])))
1 loop, best of 3: 1.31 s per loop

For separate to sublists I create question, you can check solution here:

def jez_coldspeed(df):
    L = list(zip(df['words'], df['tags']))
    L2 = []
    for i in L[::-1]:
        if i == ('.','ZZ'):
            L2.append([])

        L2[-1].append(i)

    return [x[::-1] for x in L2[::-1]]

def jez_coldspeed1(df):
    L = list(zip(df['words'], df['tags']))
    L2 = []
    sep = ('.','ZZ')
    for i in reversed(L):
         if i == sep:
             L2.append([])

         L2[-1].append(i)

    return [x[::-1] for x in reversed(L2)]


In [74]: %timeit (jez_coldspeed(df))
100 loops, best of 3: 2.96 ms per loop

In [75]: %timeit (jez_coldspeed1(df))
100 loops, best of 3: 2.95 ms per loop

def jez_theBuzzyCoder(df):
    L = list(zip(df['words'], df['tags']))
    a = list()
    start = 0
    sep = ('.', 'ZZ')

    while start < len(L) and (L.index(sep, start) != -1):
        end = L.index(sep, start) + 1
        a.append(L[start:end])
        start = end
    return a


print (jez_theBuzzyCoder(df))

In [81]: %timeit (jez_theBuzzyCoder(df))
100 loops, best of 3: 3.16 ms per loop
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download