Zphinx677 Zphinx677 - 6 months ago 18
Python Question

pandas combining dataframe

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

java = pickle.load(open('JavaSafe.p','rb')) ##import 2d array
python = pickle.load(open('PythonSafe.p','rb')) ##import 2d array

javaFrame = pd.DataFrame(java,columns=['Town','Java Jobs'])
pythonFrame = pd.DataFrame(python,columns=['Town','Python Jobs'])
javaFrame = javaFrame.sort_values(by='Java Jobs',ascending=False)
pythonFrame = pythonFrame.sort_values(by='Python Jobs',ascending=False)
print(javaFrame,"\n",pythonFrame)


This code comes out with the following:

Town Java Jobs
435 York,NY 3593
212 NewYork,NY 3585
584 Seattle,WA 2080
624 Chicago,IL 1920
301 Boston,MA 1571
...
79 Holland,MI 5
38 Manhattan,KS 5
497 Vernon,IL 5
30 Clayton,MO 5
90 Waukegan,IL 5

[653 rows x 2 columns]

Town Python Jobs
160 NewYork,NY 2949
11 York,NY 2938
349 Seattle,WA 1321
91 Chicago,IL 1312
167 Boston,MA 1117

383 Hanover,NH 5
209 Bulverde,TX 5
203 Salisbury,NC 5
67 Rockford,IL 5
256 Ventura,CA 5

[416 rows x 2 columns]


I want to make a new dataframe that uses the town names as an index and has a column for each java and python. However, some of the towns will only have results for one of the languages.

Answer
import pandas as pd

javaFrame = pd.DataFrame({'Java Jobs': [3593, 3585, 2080, 1920, 1571, 5, 5, 5, 5, 5],
     'Town': ['York,NY', 'NewYork,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Holland,MI', 'Manhattan,KS', 'Vernon,IL', 'Clayton,MO', 'Waukegan,IL']}, index=[435, 212, 584, 624, 301, 79, 38, 497, 30, 90])
pythonFrame = pd.DataFrame({'Python Jobs': [2949, 2938, 1321, 1312, 1117, 5, 5, 5, 5, 5],
     'Town': ['NewYork,NY', 'York,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Hanover,NH', 'Bulverde,TX', 'Salisbury,NC', 'Rockford,IL', 'Ventura,CA']}, index=[160, 11, 349, 91, 167, 383, 209, 203, 67, 256])

result = pd.merge(javaFrame, pythonFrame, how='outer').set_index('Town')
#               Java Jobs  Python Jobs
# Town                                
# York,NY          3593.0       2938.0
# NewYork,NY       3585.0       2949.0
# Seattle,WA       2080.0       1321.0
# Chicago,IL       1920.0       1312.0
# Boston,MA        1571.0       1117.0
# Holland,MI          5.0          NaN
# Manhattan,KS        5.0          NaN
# Vernon,IL           5.0          NaN
# Clayton,MO          5.0          NaN
# Waukegan,IL         5.0          NaN
# Hanover,NH          NaN          5.0
# Bulverde,TX         NaN          5.0
# Salisbury,NC        NaN          5.0
# Rockford,IL         NaN          5.0
# Ventura,CA          NaN          5.0

pd.merge will by default join two DataFrames on all columns shared in common. In this cae, javaFrame and pythonFrame share only the Town column in common. So by default pd.merge would join the two DataFrames on the Town column.

how='outer causes pd.merge to use the union of the keys from both frames. In other words it causes pd.merge to return rows whose data come from either javaFrame or pythonFrame even if only one DataFrame contains the Town. Missing data is fill with NaNs.