Konstantino Sparakis Konstantino Sparakis - 12 days ago 6
Python Question

Using Python To Correlate multiple Time Series

I have a data frame, which is grouped by two columns "AvailbilityZone" and "InstanceType" as seen below.

enter image description here

I create this using the following code:

#Create full time series and fill data
dfSorted = df.groupby(['AvailabilityZone', 'InstanceType'])
dfSorted = dfSorted.resample('H')
dfSorted = dfSorted.fillna("ffill")
dfSorted = dfSorted.dropna()


Every grouping represents a time series. I have already resampled the data so that all the time series are running on an hourly basis. How do I run a correlation to find out how similar each time series is with one another?

When I use:

dfSorted.corr()


it just returns SpotPrice = 1 So I am going to assume I will probably have to use something like a loop? and compare each time series with the other? I'm lost any help is much appreciated!

Here Is my dataframe as a csv file:
https://www.dropbox.com/s/xgv8xm5n5o856mx/out.csv?dl=0

I simply used df.tocsv()

Answer

When you create groups, I am assuming you use groupby. You can first create your groups:

groups = df.groupby(['whatever','grouping'])

Then you can get a list of lists for the value you want to correlate, I believe in your case this is SpotPrice. So;

grouped_prices = [g['SpotPrice'].tolist() for i,g in groups]

numpy.corrcoef takes list of lists as input then calculates correlation between each list and returns you a correlation coefficient matrix . See: https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html

numpy.corrcoef(grouped_prices)

This is your correlation coefficient matrix.

numpy.corrcoef and pandas.DataFrame.corr should give you the same results because by default they both calculate Pearson correlation coefficient. I chose to use numpy.corrcoef because I think it's easier in this case.

Also, before you use Pearson Correlation you should know that it only measures linear relations between variables and also there are certain assumptions that your data must meet in order to use it. See here for example.