afora377 - 1 month ago 6

Python Question

I have a very large dataframe which has 100 years of dates as column headers (i.e. ~36500 columns) and 100 years of dates as indices (ie ~36500 rows). I have a function which calculates a value for each of the elements of the dataframe which will need to be run 36500^2 times.

Ok, the problem is not the function which is quite fast, but rather assignment of values to the dataframe. It takes about 1 sec per 6 assignments even if I assign a constant this way. Obviously I'm pretty thick as you can tell:

`for i, row in df_mBase.iterrows():`

for idx, val in enumerate(row):

df_mBase.ix[i][idx] = 1

print(i)

Ordinarily in C/Java i would simply loop through a 36500x36500 double loop and access the preassigned memory directly via indexing which can be achieved in constant time with virtually no overhead. But this appears to be not an option in python?

What would be the fastest way to store this data in a dataframe? Pythonian or not, I'm after speed only - I dont care for elegance.

Answer Source

You should create the data structure either in native python or in numpy and pass the data to a the DataFrame constructor. If your function can be written using numpy's function/operation, then you can use the vectorized nature of numpy to avoid looping over all indices.

Here is an example with a made up function:

```
import numpy as np
import pandas as pd
import datetime as dt
import dateutil as du
dates = [dt.date(2017, 1, 1) - du.relativedelta.relativedelta(days=i) for i in range(36500)]
data = np.zeros((36500,36500), dtype=np.uint8)
def my_func(i, j):
return (sum(divmod(i,j)) - sum(divmod(j,i))) % 255
for i in range(1, 36500):
for j in range(1, 36500):
data[i,j] = my_func(i,j)
df = pd.DataFrame(data, columns=dates, index=dates)
df.head(5)
#returns:
2017-08-21 2017-08-20 2017-08-19 2017-08-18 2017-08-17 \
2017-08-21 0 0 0 0 0
2017-08-20 0 0 254 253 252
2017-08-19 0 1 0 0 0
2017-08-18 0 2 0 0 1
2017-08-17 0 3 0 254 0
... 1917-09-19 1917-09-18 1917-09-17 1917-09-16
2017-08-21 ... 0 0 0 0
2017-08-20 ... 225 224 223 222
2017-08-19 ... 114 113 113 112
2017-08-18 ... 77 76 77 76
2017-08-17 ... 60 59 58 57
```