flyingmeatball flyingmeatball - 6 months ago 32
Python Question

pandas condensing multiple columns for same value

I'm trying to integrate multiple data sources and I'm finding it a bit tricky to concisely combine my columns. I have a dataframe that looks something like this:

df = pd.DataFrame([['Address Data','City data','State Data', 'Zip Data', np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data']],
columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])

print df

Address1 City1 State1 Zip1 Address2 City2/
0 Address Data City Data State Data Zip Data NaN NaN
1 NaN NaN NaN NaN Address Data City Data
2 NaN NaN NaN NaN NaN NaN

State2 Zip2 Address3 City3 State3 Zip3
0 NaN NaN NaN NaN NaN NaN
1 State Data Zip Data NaN NaN NaN NaN
2 NaN NaN Address Data City Data State Data Zip Data


I'm trying to combine to something like this:

#run code here to combine multiple versions of the columns

print df

Address City State Zip
0 Address Data City data State Data Zip Data
1 Address Data City data State Data Zip Data
2 Address Data City data State Data Zip Data


I was hoping I could use lines like this to set multiple columns at once:

df.loc[df['State1'].notnull(),['Address','city','state','State','Zip']] = df.loc[df['State1'].notnull(),['Address1','City1','State1','Zip1']].values


However, .loc doesn't allow you to create multiple columns at once. Any thoughts on how to combine the data would be much appreciated!

Edit based on solution below

The problem with that solution is my data may actually look more like this:

df = pd.DataFrame([['Address Data','City data','State Data', 'Zip Data', 'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data',np.nan,np.nan,np.nan,np.nan],
[ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data','City data','State Data', 'Zip Data']],
columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])


In this case, the soltuion below gives me 4 rows, when I only want the original 3. What I want to say is "if section 1 has data, use section 1, else if section 2 has data use section 2, else if section 3 has data, use section 3". Each row needs to remain unique, because there are many other attributes of a row that I am not consolidating. Thanks!

Answer

I think solution is totally different, so I decide create new answer:

import pandas as pd
import numpy as np

#random dataframe
np.random.seed(1)
df1 = pd.DataFrame(np.random.randint(10, size=(3,9)))
df1.columns = ['a1','b1','c1','a2','b2','c2','a3','b3','c3']

df1.loc[[1,2],['a1','b1','c1']] = np.nan
print (df1)
    a1   b1   c1  a2  b2  c2  a3  b3  c3
0  5.0  8.0  9.0   5   0   0   1   7   6
1  NaN  NaN  NaN   5   2   4   2   4   7
2  NaN  NaN  NaN   7   0   6   9   9   7

#stack dataframe and extract string and numbers from column e
df = df1.stack().reset_index()
df.columns= ['d','e','f']
df[['g','h']] = df.e.str.extract(r'([a-zA-Z]+)([0-9]+)', expand=True)

#append 1 to d, because index starts from 1 and compare with h
df = df[df.d + 1 == df.h.astype(int)]
#remove columns h, e
df = df.drop(['h', 'e'], axis=1)
#reshaping
df = df.pivot(index='d', columns='g', values='f')
#remove index and columns names (pandas 0.18.0+)
df = df.rename_axis(None).rename_axis(None, axis=1)

print (df)
     a    b    c
0  5.0  8.0  9.0
1  5.0  2.0  4.0
2  9.0  9.0  7.0

EDIT: I try modify a bit your sample:

import pandas as pd
import numpy as np

df1 = pd.DataFrame([['Address Data1','City data1','State Data1', 'Zip Data1', 'Address Data2','City data2','State Data2', 'Zip Data2',np.nan,np.nan,np.nan,np.nan],
 [ np.nan,np.nan,np.nan,np.nan,'Address Data3','City data3','State Data3', 'Zip Data3',np.nan,np.nan,np.nan,np.nan],
              [ np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Address Data4','City data4','State Data4', 'Zip Data4']],
             columns = ['Address1','City1','State1','Zip1','Address2','City2','State2','Zip2','Address3','City3','State3','Zip3'])

print (df1)
        Address1       City1       State1       Zip1       Address2  \
0  Address Data1  City data1  State Data1  Zip Data1  Address Data2   
1            NaN         NaN          NaN        NaN  Address Data3   
2            NaN         NaN          NaN        NaN            NaN   

        City2       State2       Zip2       Address3       City3       State3  \
0  City data2  State Data2  Zip Data2            NaN         NaN          NaN   
1  City data3  State Data3  Zip Data3            NaN         NaN          NaN   
2         NaN          NaN        NaN  Address Data4  City data4  State Data4   

        Zip3  
0        NaN  
1        NaN  
2  Zip Data4  
#stack dataframe and extract string and numbers from column e
df = df1.stack().reset_index()
df.columns= ['d','e','f']
df[['g','h']] = df.e.str.extract(r'([a-zA-Z]+)([0-9]+)', expand=True)

#append 1 to d, because index starts from 1 and compare with h
df = df[df.d + 1 == df.h.astype(int)]
#remove columns h, e
df = df.drop(['h', 'e'], axis=1)
#reshaping
df = df.pivot(index='d', columns='g', values='f')

df = df.rename_axis(None).rename_axis(None, axis=1)
print (df)
         Address        City        State        Zip
0  Address Data1  City data1  State Data1  Zip Data1
1  Address Data3  City data3  State Data3  Zip Data3
2  Address Data4  City data4  State Data4  Zip Data4