My data is in the following format:
index keys
1 key 1
2 key 1\nkey 2\nkey 3
3 key 2\nkey 4\nkey 5
5 key 2\nkey 3\nkey 5
\n
True
1
index key 1 key 2 key 3 key 4 key 5
1 1 0 0 0 0
2 1 1 1 0 0
3 0 1 0 1 1
5 0 1 1 0 1
pandas.DataFrame.from_dict
l=[]
for i in df.index:
d={j:True for j in df["keys"][i].split("\n")}
l.append(d)
new_df=pandas.DataFrame(l)
NaN
int64
Option 1
str.split
+ stack
+ str.get_dummies
+ groupby
+ sum
:
df
keys
index
1 key 1
2 key 1\nkey 2\nkey 3
3 key 2\nkey 4\nkey 5
5 key 2\nkey 3\nkey 5
df['keys'].str.split('\n', expand=True).stack()\
.str.get_dummies().groupby(level=0).sum()
key 1 key 2 key 3 key 4 key 5
index
1 1 0 0 0 0
2 1 1 1 0 0
3 0 1 0 1 1
5 0 1 1 0 1
Option 2
Using pd.get_dummies
, bypassing stack
:
d = pd.get_dummies(df['keys'].str.split('\n', expand=True))
d.groupby(d.columns.str.split('_').str[1], axis=1).sum()
key 1 key 2 key 3 key 4 key 5
index
1 1 0 0 0 0
2 1 1 1 0 0
3 0 1 0 1 1
5 0 1 1 0 1
Option 3
Using stack
+ value_counts
+ unstack
:
df['keys'].str.split('\n', expand=True).stack()\
.groupby(level=0).value_counts().unstack().fillna(0)
key 1 key 2 key 3 key 4 key 5
index
1 1.0 0.0 0.0 0.0 0.0
2 1.0 1.0 1.0 0.0 0.0
3 0.0 1.0 0.0 1.0 1.0
5 0.0 1.0 1.0 0.0 1.0