user3313834 user3313834 - 4 months ago 8
Python Question

python pandas from 0/1 dataframe to an itemset list

What is the most efficient way to go from a 0/1 pandas/numpy dataframe of this form::

>>> dd
{'a': {0: 1, 1: 0, 2: 1, 3: 0, 4: 1, 5: 1},
'b': {0: 1, 1: 1, 2: 0, 3: 0, 4: 1, 5: 1},
'c': {0: 0, 1: 1, 2: 1, 3: 0, 4: 1, 5: 1},
'd': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1},
'e': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0}}
>>> df = pd.DataFrame(dd)
>>> df
a b c d e
0 1 1 0 0 0
1 0 1 1 1 0
2 1 0 1 1 1
3 0 0 0 1 0
4 1 1 1 0 0
5 1 1 1 1 0
>>>


To an itemset list of list ?::

itemset = [['a', 'b'],
['b', 'c', 'd'],
['a', 'c', 'd', 'e'],
['d'],
['a', 'b', 'c'],
['a', 'b', 'c', 'd']]


df.shape ~
(1e6, 500)

Answer

You can first multiple by columns names by mul and convert DataFrame to numpy array by values:

print (df.mul(df.columns.to_series()).values)
[['a' 'b' '' '' '']
 ['' 'b' 'c' 'd' '']
 ['a' '' 'c' 'd' 'e']
 ['' '' '' 'd' '']
 ['a' 'b' 'c' '' '']
 ['a' 'b' 'c' 'd' '']]

Remove empty string by nested list comprehension:

print ([[y for y in x if y != ''] for x in df.mul(df.columns.to_series()).values])
[['a', 'b'], 
 ['b', 'c', 'd'],
 ['a', 'c', 'd', 'e'], 
 ['d'], 
 ['a', 'b', 'c'], 
 ['a', 'b', 'c', 'd']]
Comments