baloo baloo - 3 years ago 187
Python Question

What is the faster way to expand cells content into columns names with pandas?

My data is in the following format:

index keys
1 key 1
2 key 1\nkey 2\nkey 3
3 key 2\nkey 4\nkey 5
5 key 2\nkey 3\nkey 5


The keywords that I am interested in are in the keys column, separated with
\n
. I want to transform each key into a column name and mark with a
True
or
1
the row from where it came, as follows:

index key 1 key 2 key 3 key 4 key 5
1 1 0 0 0 0
2 1 1 1 0 0
3 0 1 0 1 1
5 0 1 1 0 1


Right now, what I do is that I iterate over the index, save into a dictionary the list of the keys for each row, and then transform re-import it with
pandas.DataFrame.from_dict
, like this:

l=[]
for i in df.index:
d={j:True for j in df["keys"][i].split("\n")}
l.append(d)
new_df=pandas.DataFrame(l)


It is pretty fast, but since there are a lot of
NaN
, the memory consumption of the dataframe before I convert it to an
int64
type is quite huge. Yet I have dozens of thousands of rows. Does anyone know of a more efficient way of doing it, if possible that would avoid iterating over the rows myself ?

Answer Source

Option 1
str.split + stack + str.get_dummies + groupby + sum:

df    
                      keys
index                     
1                    key 1
2      key 1\nkey 2\nkey 3
3      key 2\nkey 4\nkey 5
5      key 2\nkey 3\nkey 5

df['keys'].str.split('\n', expand=True).stack()\
                 .str.get_dummies().groupby(level=0).sum()

       key 1  key 2  key 3  key 4  key 5
index                                   
1          1      0      0      0      0
2          1      1      1      0      0
3          0      1      0      1      1
5          0      1      1      0      1

Option 2
Using pd.get_dummies, bypassing stack:

d = pd.get_dummies(df['keys'].str.split('\n', expand=True))
d.groupby(d.columns.str.split('_').str[1], axis=1).sum()

       key 1  key 2  key 3  key 4  key 5
index                                   
1          1      0      0      0      0
2          1      1      1      0      0
3          0      1      0      1      1
5          0      1      1      0      1

Option 3
Using stack + value_counts + unstack:

df['keys'].str.split('\n', expand=True).stack()\
       .groupby(level=0).value_counts().unstack().fillna(0)

       key 1  key 2  key 3  key 4  key 5
index                                   
1        1.0    0.0    0.0    0.0    0.0
2        1.0    1.0    1.0    0.0    0.0
3        0.0    1.0    0.0    1.0    1.0
5        0.0    1.0    1.0    0.0    1.0
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download