Nyitrai Lőrinc Nyitrai Lőrinc - 3 months ago 9
Python Question

normalizing data by duplication

note: this question is indeed a duplicate of Split pandas dataframe string entry to separate rows, but the answer provided here is more generic and informative, so with all respect due, I chose not to delete the thread




I have a 'dataset' with the following format:

id | value | ...
--------|-------|------
a | 156 | ...
b,c | 457 | ...
e,g,f,h | 346 | ...
... | ... | ...


and I would like to normalize it by duplicating all values for each ids:

id | value | ...
--------|-------|------
a | 156 | ...
b | 457 | ...
c | 457 | ...
e | 346 | ...
g | 346 | ...
f | 346 | ...
h | 346 | ...
... | ... | ...


What I'm doing is applying the split-apply-combine principle of
pandas
using
.groupby
that creates a
tuple
for each group
(groupby value, pd.DataFrame())


I created a column to group by that simply counts the ids in the row:

df['count_ids'] = df['id'].str.split(',').apply(lambda x: len(x))

id | value | count_ids
--------|-------|------
a | 156 | 1
b,c | 457 | 2
e,g,f,h | 346 | 4
... | ... | ...


The way I'm duplicating the rows is as follows:

pd.DataFrame().append([group]*count_ids)


I'm slowly progressing, but it is really complex, and I would appreciate any best practice or recommendation you can share with this type of problems.

Answer

try this:

In [44]: df
Out[44]:
        id  value
0        a    156
1      b,c    457
2  e,g,f,h    346

In [45]: (df['id'].str.split(',', expand=True)
   ....:          .stack()
   ....:          .reset_index(level=0)
   ....:          .set_index('level_0')
   ....:          .rename(columns={0:'id'})
   ....:          .join(df.drop('id',1), how='left')
   ....: )
Out[45]:
  id  value
0  a    156
1  b    457
1  c    457
2  e    346
2  g    346
2  f    346
2  h    346

Explanation:

In [48]: df['id'].str.split(',', expand=True).stack()
Out[48]:
0  0    a
1  0    b
   1    c
2  0    e
   1    g
   2    f
   3    h
dtype: object

In [49]: df['id'].str.split(',', expand=True).stack().reset_index(level=0)
Out[49]:
   level_0  0
0        0  a
0        1  b
1        1  c
0        2  e
1        2  g
2        2  f
3        2  h

In [50]: df['id'].str.split(',', expand=True).stack().reset_index(level=0).set_index('level_0')
Out[50]:
         0
level_0
0        a
1        b
1        c
2        e
2        g
2        f
2        h

In [51]: df['id'].str.split(',', expand=True).stack().reset_index(level=0).set_index('level_0').rename(columns={0:'id'})
Out[51]:
        id
level_0
0        a
1        b
1        c
2        e
2        g
2        f
2        h

In [52]: df.drop('id',1)
Out[52]:
   value
0    156
1    457
2    346
Comments