BENAMARA ILAN BENAMARA ILAN - 6 months ago 7
Python Question

Transpose a Python Dataframe according many variables

I have a Dataframe from one file, i have products and price together. I have on file per day who registered all the products bought by customer. So the length of columns depends of the biggest number of product bought by One client. On the start, i have a file like that :

date conv product Prices
01/2016 'part ' A|B|C|E|F 15|20|30|40|50
01/2016 'Pro' D|B 10|10


Then i split that file by "|", and after that i have 5 columns on my new df. beacause the biggest number of product bought by One client for that day is equal 5.

The Final DataFrame give :

Date Conv Product_1 product_2 ... product_n price_1 price_2 ... price_n
01/2016 'Part' A B C 15 20 30
01/2016 'Pro' B D C 10 10 20
02/2016 'Part' E A B 25 5 10


I would like to transpose Variables "Product_1 ...product_n" and "price 1 ... price_n". And obtain a new df :

Date Conv Product price
01/2016 'Part' A 15
01/2016 'Part' B 20
01/2016 'Part' C 30
01/2016 'Pro' B 10
01/2016 'Pro' D 10
01/2016 'Pro' C 20
02/2016 'Part' E 25
02/2016 'Part' A 5
02/2016 'Part' B 10


The difficulty is on transpose variable and copy the variable Date and conv.

I think with
SAS
we can obtain that with the code :

Proc transpose ;
Data = DF;
VAR product_1-product_4 price1_price_4;
BY Date Conv;
COPY Date Conv;


But on Python I don't find an equivalent.

Someone Know how can I do please ?

I try with :
df.transpose


But it's not the result that I want.

Answer

You can first select columns by list comprehension with startswith and then use pd.lreshape:

prods = ([col for col in df.columns if col.startswith('product_')])
prices = ([col for col in df.columns if col.startswith('price_')])

print (prods)
['product_1', 'product_2', 'product_n']
print (prices)
['price_1', 'price_2', 'price_n']

df1 = pd.lreshape(df, {'product' : prods, 'price' : prices}) 
print (df1)
     Conv     Date  price product
0  'Part'  01/2016     15       A
1   'Pro'  01/2016     10       B
2  'Part'  02/2016     25       E
3  'Part'  01/2016     20       B
4   'Pro'  01/2016     10       D
5  'Part'  02/2016      5       A
6  'Part'  01/2016     30       C
7   'Pro'  01/2016     20       C
8  'Part'  02/2016     10       B

EDIT by more specify question:

#new df1 from column product
df1 = (df['product'].str.split('|', expand=True))
#add prod_ to column names
prods = df1.columns = ['prod_' + str(col) for col in df1.columns] 

#new df2 from column Prices
df2 = (df['Prices'].str.split('|', expand=True))
#add part_ to column names
prices = df2.columns = ['part_' + str(col) for col in df2.columns]

#join all together
df3 = (pd.concat([df[['date','conv']], df1, df2], axis=1))

#reshape
print (pd.lreshape(df3, {'product' : prods, 'price' : prices}))
     conv     date price product
0  'part'  01/2016    15       A
1   'pro'  01/2016    10       D
2  'part'  01/2016    20       B
3   'pro'  01/2016    10       B
4  'part'  01/2016    30       C
5  'part'  01/2016    40       E
6  'part'  01/2016    50       F

Another solution with join:

#create dataframe and stack, drop level of multiindex
s1 = (df['product'].str.split('|', expand=True)).stack()
s1.index = s1.index.droplevel(-1)
s1.name = 'product'

s2 = (df['Prices'].str.split('|', expand=True)).stack()
s2.index = s2.index.droplevel(-1)
s2.name = 'price'

#remove original columns    
df = df.drop(['product','Prices'], axis=1)

#join series to dataframe    
df1 = (df.join(s1).reset_index(drop=True))
df2 = (df.join(s2).reset_index(drop=True))

#join all togehter
print (pd.concat([df1, df2[['price']]], axis=1))
      date    conv product price
0  01/2016  'part'       A    15
1  01/2016  'part'       B    20
2  01/2016  'part'       C    30
3  01/2016  'part'       E    40
4  01/2016  'part'       F    50
5  01/2016   'pro'       D    10
6  01/2016   'pro'       B    10

Timings:

In [598]: %timeit (a(df))
100 loops, best of 3: 10.6 ms per loop

In [599]: %timeit (b(df_a))
100 loops, best of 3: 14.1 ms per loop

Code for timings:

import pandas as pd

df = pd.DataFrame({'date': {0: '01/2016', 1: '01/2016'}, 
                   'conv': {0: "'part'", 1: "'pro'"}, 
                   'Prices': {0: '15|20|30|40|50', 1: '10|10'}, 
                   'product': {0: 'A|B|C|E|F', 1: 'D|B'}}, 
                    columns =['date','conv','product','Prices'])
df = pd.concat([df]*1000).reset_index(drop=True)

print (df)
df_a = df.copy()

def a(df):
    df1 = (df['product'].str.split('|', expand=True))
    prods = df1.columns = ['prod_' + str(col) for col in df1.columns] 

    df2 = (df['Prices'].str.split('|', expand=True))
    prices = df2.columns = ['part_' + str(col) for col in df2.columns]

    df3 = (pd.concat([df[['date','conv']], df1, df2], axis=1))

    return (pd.lreshape(df3, {'product' : prods, 'price' : prices}))


def b(df):
    s1 = (df['product'].str.split('|', expand=True)).stack()
    s1.index = s1.index.droplevel(-1)
    s1.name = 'product'

    s2 = (df['Prices'].str.split('|', expand=True)).stack()
    s2.index = s2.index.droplevel(-1)
    s2.name = 'price'

    df = df.drop(['product','Prices'], axis=1)

    df1 = (df.join(s1).reset_index(drop=True))
    df2 = (df.join(s2).reset_index(drop=True))

    return (pd.concat([df1, df2[['price']]], axis=1))

print (a(df))    
print (b(df_a))