Konstantino Sparakis - 1 year ago 136

Python Question

I have a data frame, which is grouped by two columns "AvailbilityZone" and "InstanceType" as seen below.

I create this using the following code:

`#Create full time series and fill data`

dfSorted = df.groupby(['AvailabilityZone', 'InstanceType'])

dfSorted = dfSorted.resample('H')

dfSorted = dfSorted.fillna("ffill")

dfSorted = dfSorted.dropna()

Every grouping represents a time series. I have already resampled the data so that all the time series are running on an hourly basis. How do I run a correlation to find out how similar each time series is with one another?

When I use:

`dfSorted.corr()`

it just returns SpotPrice = 1 So I am going to assume I will probably have to use something like a loop? and compare each time series with the other? I'm lost any help is much appreciated!

Here Is my dataframe as a csv file:

https://www.dropbox.com/s/xgv8xm5n5o856mx/out.csv?dl=0

I simply used df.tocsv()

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

When you create groups, I am assuming you use `groupby`

. You can first create your groups:

```
groups = df.groupby(['whatever','grouping'])
```

Then you can get a list of lists for the value you want to correlate, I believe in your case this is `SpotPrice`

. So;

```
grouped_prices = [g['SpotPrice'].tolist() for i,g in groups]
```

`numpy.corrcoef`

takes list of lists as input then calculates correlation between each list and returns you a correlation coefficient matrix . See: https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html

```
numpy.corrcoef(grouped_prices)
```

This is your correlation coefficient matrix.

`numpy.corrcoef`

and `pandas.DataFrame.corr`

should give you the same results because by default they both calculate `Pearson`

correlation coefficient. I chose to use `numpy.corrcoef`

because I think it's easier in this case.

Also, before you use Pearson Correlation you should know that it only measures linear relations between variables and also there are certain assumptions that your data must meet in order to use it. See here for example.

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**