Michael Perdue Michael Perdue - 7 months ago 9
Python Question

Pandas: dataframes won't merge

I have two data frames (that can be found here and here) below:

df= pd.read_csv('Thesis/ExternalData/naics_conversion_data/SIC2CRPCats.csv', \
engine='python', sep=r'\s{2,}', encoding='utf-8_sig')


I have only provided the code for reading in
df
because it has some unique formatting issues.

df.dtypes

SICcode object
Catcode object
Category object
SICname object
MultSIC object
dtype: object

merged.dtypes

2012 NAICS Code float64
2002to2007 NAICS float64
SICcode object
dtype: object

df.columns.tolist()
['SICcode', 'Catcode', 'Category', 'SICname', 'MultSIC']

merged.columns.tolist()
['2012 NAICS Code', '2002to2007 NAICS', 'SICcode']

df.head(3)

SICcode Catcode Category SICname MultSIC
0 111 A1500 Wheat, corn, soybeans and cash grain Wheat X
1 112 A1600 Other commodities (incl rice, peanuts) Rice X
2 115 A1500 Wheat, corn, soybeans and cash grain Corn X

merged.sort_values('SICcode')

2012 NAICS Code 2002to2007 NAICS SICcode
89 212210 212210 1011
93 212234 212234 1021
92 212231 212231 1031
90 212221 212221 1041
91 212222 212222 1044
96 212299 212299 1061
94 212234 212234 1061
119 213114 213114 1081
1770 541360 541360 1081
233 238910 238910 1081
95 212291 212291 1094
97 212299 212299 1099
3 111140 111140 111
6 111160 111160 112
4 111150 111150 115
0 111110 111110 116


I am trying to merge them together with this code:
merged=pd.merge(merged,df, how='right', on='SICcode')


Which results in this:

2012 NAICS Code 0
2002to2007 NAICS 0
SICcode 1007
Catcode 991
Category 1007
SICname 1007
MultSIC 906
dtype: int64


I suspect that the issues lies with the formatting of
df
but I do not know how to describe (I have heard the term
white space
, maybe that is relevant for this scenario) or remedy the issue. Does anyone have an idea on this?

Answer

I believe this is the cause of your problem:

In [47]: merged[merged.SICcode == 'Aux']
Out[47]:
      2012 NAICS Code  2002to2007 NAICS SICcode
1828         551114.0          551114.0     Aux

resulting in different data types:

In [61]: df.dtypes
Out[61]:
SICcode      int64
Catcode     object
Category    object
SICname     object
MultSIC     object
dtype: object

In [62]: merged.dtypes
Out[62]:
2012 NAICS Code     float64
2002to2007 NAICS    float64
SICcode              object
dtype: object

In [63]: df.SICcode.unique()
Out[63]: array([ 111,  112,  115, ..., 9711, 9721, 9999], dtype=int64)

In [64]: merged.SICcode.head(10).unique()
Out[64]: array(['116', '119', '111', '115', '112', '139'], dtype=object)

So you can do it this way:

url = 'https://raw.githubusercontent.com/108michael/ms_thesis/master/SIC2CRPCats.csv'
df = pd.read_csv(url, engine='python', sep=r'\s{2,}', encoding='utf-8_sig')

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/test.merge'
merged = pd.read_csv(url, index_col=0)

# clearing data
merged.SICcode = pd.to_numeric(merged.SICcode, errors='coerce')

mrg = df.merge(merged, on='SICcode', how='left')

mrg.head()

Output:

In [51]: mrg.head()
Out[51]:
   SICcode Catcode                                       Category  \
0      111   A1500           Wheat, corn, soybeans and cash grain
1      112   A1600  Other commodities (incl rice, peanuts, honey)
2      115   A1500           Wheat, corn, soybeans and cash grain
3      116   A1500           Wheat, corn, soybeans and cash grain
4      119   A1500           Wheat, corn, soybeans and cash grain

            SICname MultSIC  2012 NAICS Code  2002to2007 NAICS
0             Wheat       X         111140.0          111140.0
1              Rice       X         111160.0          111160.0
2              Corn       X         111150.0          111150.0
3          Soybeans       X         111110.0          111110.0
4  Cash grains, NEC       X         111120.0          111120.0
Comments