afora377 afora377 - 1 month ago 6
Python Question

fast way of populating a very large dataframe with values

I have a very large dataframe which has 100 years of dates as column headers (i.e. ~36500 columns) and 100 years of dates as indices (ie ~36500 rows). I have a function which calculates a value for each of the elements of the dataframe which will need to be run 36500^2 times.

Ok, the problem is not the function which is quite fast, but rather assignment of values to the dataframe. It takes about 1 sec per 6 assignments even if I assign a constant this way. Obviously I'm pretty thick as you can tell:

for i, row in df_mBase.iterrows():
for idx, val in enumerate(row):
df_mBase.ix[i][idx] = 1
print(i)


Ordinarily in C/Java i would simply loop through a 36500x36500 double loop and access the preassigned memory directly via indexing which can be achieved in constant time with virtually no overhead. But this appears to be not an option in python?

What would be the fastest way to store this data in a dataframe? Pythonian or not, I'm after speed only - I dont care for elegance.

Answer Source

You should create the data structure either in native python or in numpy and pass the data to a the DataFrame constructor. If your function can be written using numpy's function/operation, then you can use the vectorized nature of numpy to avoid looping over all indices.

Here is an example with a made up function:

import numpy as np
import pandas as pd
import datetime as dt
import dateutil as du

dates = [dt.date(2017, 1, 1) - du.relativedelta.relativedelta(days=i) for i in range(36500)]
data = np.zeros((36500,36500), dtype=np.uint8)

def my_func(i, j):
    return (sum(divmod(i,j)) - sum(divmod(j,i))) % 255

for i in range(1, 36500):
    for j in range(1, 36500):
        data[i,j] = my_func(i,j)

df = pd.DataFrame(data, columns=dates, index=dates)

df.head(5)
#returns:

            2017-08-21  2017-08-20  2017-08-19  2017-08-18  2017-08-17  \
2017-08-21           0           0           0           0           0
2017-08-20           0           0         254         253         252
2017-08-19           0           1           0           0           0
2017-08-18           0           2           0           0           1
2017-08-17           0           3           0         254           0

               ...      1917-09-19  1917-09-18  1917-09-17  1917-09-16
2017-08-21     ...               0           0           0           0
2017-08-20     ...             225         224         223         222
2017-08-19     ...             114         113         113         112
2017-08-18     ...              77          76          77          76
2017-08-17     ...              60          59          58          57