nteiss nteiss - 5 months ago 27
Python Question

How to join two pandas DataFrames in a "horizontal union" with python3

I have to large .csv files that I'm handling with python, pandas and numpy, here is a sample from the more granular data set (A), the time stamps are in 15 minute intervals:

Timestamp,Real Energy Into the Load
2016-06-01T11:00:00, 2
2016-06-01T10:45:00, 1
2016-06-01T10:30:00, 5
2016-06-01T10:15:00, 3
2016-06-01T10:00:00, 3
2016-06-01T09:45:00, 6
2016-06-01T09:30:00, 2
...


and here is a sample from the less granular data (B) set with timestamps approximately an hour apart, but there is a lot a variance between time stamps.

TimeEDT, TemperatureF, Dew PointF
2016-06-01T10:33:00,82.0,66.0
2016-06-01T09:34:00,79.0,64.9
2016-06-01T09:20:00,75.9,64.9
...


I want to combine them such that the combined data frame will have the same number of rows as dataframe B by grouping averages from data frame A's rows. The final rows corresponding to this would be:

TimeEDT, TemperatureF, Dew PointF, Real Energy Into The Load
2016-06-01T10:33:00,82.0,66.0, 1.5 # average of (1, 2)
2016-06-01T09:34:00,79.0,64.9, 4.25 # average of (6, 3, 3, 5)
2016-06-01T09:20:00,75.9,64.9, 2 # average of (2,)
...


I think this is called a horizontal union in SQL.

Things I've already tried:
I took dataset B (dfB) and used
dfB['TimeEDT'].apply
to "floor" each date to it's 15 minute hourly interval. From there, I can use the
groupby
function to sum the rows together to at least have a one-to-one correspondence between the rows, but the I still need to add the data frames horizontally. But I would like to have a more direct way of doing this. Ideally the argument to groupby could be some user-defined comparison

Answer

May be you can do something like below. I haven't checked whether it works if there are no values during an hour, but this is the idea.

In[1]: import pandas as pd

In[2]: import numpy as np

In[3]: df1 = pd.DataFrame({"TemperatureF": np.random.random_integers(60, 90, 20), "DewPointF": np.random.random_integers(60, 90, 20)}, index = pd.date_range("2016-06-01 09:00:00", periods=20, freq="15min"))

In[4]: df2 = pd.DataFrame({"TemperatureF": np.random.random_integers(60, 90, 5), "DewPointF": np.random.random_integers(60, 90, 5), "RealEnergy": np.random.uniform(1.0, 5.0, 5)}, index = pd.date_range("2016-06-01 09:30:00", periods=5, freq="H"))

In[5]: df1
Out[5]: 
                     DewPointF  TemperatureF
2016-06-01 09:00:00         66            71
2016-06-01 09:15:00         84            68
2016-06-01 09:30:00         68            74
2016-06-01 09:45:00         66            85
2016-06-01 10:00:00         70            72
2016-06-01 10:15:00         63            78
2016-06-01 10:30:00         82            83
2016-06-01 10:45:00         67            79
2016-06-01 11:00:00         63            76
2016-06-01 11:15:00         72            80
2016-06-01 11:30:00         82            61
2016-06-01 11:45:00         60            81
2016-06-01 12:00:00         77            76
2016-06-01 12:15:00         78            60
2016-06-01 12:30:00         75            60
2016-06-01 12:45:00         83            67
2016-06-01 13:00:00         84            81
2016-06-01 13:15:00         66            66
2016-06-01 13:30:00         80            84
2016-06-01 13:45:00         87            69

In[6]: df2
Out[6]: 
                     DewPointF  RealEnergy  TemperatureF
2016-06-01 09:30:00         84    2.479343            88
2016-06-01 10:30:00         64    1.428840            67
2016-06-01 11:30:00         88    3.214666            83
2016-06-01 12:30:00         72    4.280898            71
2016-06-01 13:30:00         62    3.376502            78


In[7]: df2.merge(df1.groupby(df1.index.hour)["DewPointF", "TemperatureF"].mean(), on=df2.index.hour)
Out[7]: 
   key_0  DewPointF_x  RealEnergy  TemperatureF_x  DewPointF_y  TemperatureF_y
0      9           84    2.479343              88        71.00           74.50
1     10           64    1.428840              67        70.50           78.00
2     11           88    3.214666              83        69.25           74.50
3     12           72    4.280898              71        78.25           65.75
4     13           62    3.376502              78        79.25           75.00