andrew - 11 months ago 129

Python Question

I am confused by the performance in Pandas when building a large dataframe chunk by chunk. In Numpy, we (almost) always see better performance by preallocating a large empty array and then filling in the values. As I understand it, this is due to Numpy grabbing all the memory it needs at once instead of having to reallocate memory with every

`append`

In Pandas, I seem to be getting better performance by using the

`df = df.append(temp)`

Here is an example with timing. The definition of the

`Timer`

`append`

`np.empty`

`append`

`import numpy as np`

from numpy.random import rand

import pandas as pd

from timer import Timer

# Some constants

num_dfs = 10 # Number of random dataframes to generate

n_rows = 2500

n_cols = 40

n_reps = 100 # Number of repetitions for timing

# Generate a list of num_dfs dataframes of random values

df_list = [pd.DataFrame(rand(n_rows*n_cols).reshape((n_rows, n_cols)), columns=np.arange(n_cols)) for i in np.arange(num_dfs)]

##

# Define two methods of growing a large dataframe

##

# Method 1 - append dataframes

def method1():

out_df1 = pd.DataFrame(columns=np.arange(4))

for df in df_list:

out_df1 = out_df1.append(df, ignore_index=True)

return out_df1

def method2():

# # Create an empty dataframe that is big enough to hold all the dataframes in df_list

out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows))

#EDIT_1: Set the dtypes of each column

for ix, col in enumerate(out_df2.columns):

out_df2[col] = out_df2[col].astype(df_list[0].dtypes[ix])

# Fill in the values

for ix, df in enumerate(df_list):

out_df2.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values

return out_df2

# EDIT_2:

# Method 3 - preallocate dataframe with np.empty data of appropriate type

def method3():

# Create fake data array

data = np.transpose(np.array([np.empty(n_rows*num_dfs, dtype=dt) for dt in df_list[0].dtypes]))

# Create placeholder dataframe

out_df3 = pd.DataFrame(data)

# Fill in the real values

for ix, df in enumerate(df_list):

out_df3.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values

return out_df3

##

# Time both methods

##

# Time Method 1

times_1 = np.empty(n_reps)

for i in np.arange(n_reps):

with Timer() as t:

df1 = method1()

times_1[i] = t.secs

print 'Total time for %d repetitions of Method 1: %f [sec]' % (n_reps, np.sum(times_1))

print 'Best time: %f' % (np.min(times_1))

print 'Mean time: %f' % (np.mean(times_1))

#>> Total time for 100 repetitions of Method 1: 2.928296 [sec]

#>> Best time: 0.028532

#>> Mean time: 0.029283

# Time Method 2

times_2 = np.empty(n_reps)

for i in np.arange(n_reps):

with Timer() as t:

df2 = method2()

times_2[i] = t.secs

print 'Total time for %d repetitions of Method 2: %f [sec]' % (n_reps, np.sum(times_2))

print 'Best time: %f' % (np.min(times_2))

print 'Mean time: %f' % (np.mean(times_2))

#>> Total time for 100 repetitions of Method 2: 32.143247 [sec]

#>> Best time: 0.315075

#>> Mean time: 0.321432

# Time Method 3

times_3 = np.empty(n_reps)

for i in np.arange(n_reps):

with Timer() as t:

df3 = method3()

times_3[i] = t.secs

print 'Total time for %d repetitions of Method 3: %f [sec]' % (n_reps, np.sum(times_3))

print 'Best time: %f' % (np.min(times_3))

print 'Mean time: %f' % (np.mean(times_3))

#>> Total time for 100 repetitions of Method 3: 6.577038 [sec]

#>> Best time: 0.063437

#>> Mean time: 0.065770

I use a nice

`Timer`

`# credit: http://www.huyng.com/posts/python-performance-analysis/`

import time

class Timer(object):

def __init__(self, verbose=False):

self.verbose = verbose

def __enter__(self):

self.start = time.clock()

return self

def __exit__(self, *args):

self.end = time.clock()

self.secs = self.end - self.start

self.msecs = self.secs * 1000 # millisecs

if self.verbose:

print 'elapsed time: %f ms' % self.msecs

If you are still following, I have two questions:

1) Why is the

`append`

`n_rows = 40`

2) What is the most efficient way to build a large dataframe out of chunks? (In my case, the chunks are all large csv files).

Thanks for your help!

EDIT_1:

In my real world project, the columns have different dtypes. So I cannot use the

`pd.DataFrame(.... dtype=some_type)`

I added some lines to

`method2()`

EDIT_2: Try preallocating a dataframe using placeholder array

`np.empty(... dtyp=some_type)`

Answer Source

Your benchmark is actually too small to show the real difference. Appending, copies EACH time, so you are actually doing copying a size N memory space N*(N-1) times. This is horribly inefficient as the size of your dataframe grows. This certainly might not matter in a very small frame. But if you have any real size this matters a lot. This is specifically noted in the docs here, though kind of a small warning.

```
In [97]: df = DataFrame(np.random.randn(100000,20))
In [98]: df['B'] = 'foo'
In [99]: df['C'] = pd.Timestamp('20130101')
In [103]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 22 columns):
0 100000 non-null float64
1 100000 non-null float64
2 100000 non-null float64
3 100000 non-null float64
4 100000 non-null float64
5 100000 non-null float64
6 100000 non-null float64
7 100000 non-null float64
8 100000 non-null float64
9 100000 non-null float64
10 100000 non-null float64
11 100000 non-null float64
12 100000 non-null float64
13 100000 non-null float64
14 100000 non-null float64
15 100000 non-null float64
16 100000 non-null float64
17 100000 non-null float64
18 100000 non-null float64
19 100000 non-null float64
B 100000 non-null object
C 100000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(20), object(1)
memory usage: 17.5+ MB
```

Appending

```
In [85]: def f1():
....: result = df
....: for i in range(9):
....: result = result.append(df)
....: return result
....:
```

Concat

```
In [86]: def f2():
....: result = []
....: for i in range(10):
....: result.append(df)
....: return pd.concat(result)
....:
In [100]: f1().equals(f2())
Out[100]: True
In [101]: %timeit f1()
1 loops, best of 3: 1.66 s per loop
In [102]: %timeit f2()
1 loops, best of 3: 220 ms per loop
```

Note that I wouldn't even bother trying to pre-allocate. Its somewhat complicated, especially since you are dealing with multiple dtypes (e.g. you *could* make a giant frame and simply `.loc`

and it would work). But `pd.concat`

is just dead simple, works reliably, and fast.

And timing of your sizes from above

```
In [104]: df = DataFrame(np.random.randn(2500,40))
In [105]: %timeit f1()
10 loops, best of 3: 33.1 ms per loop
In [106]: %timeit f2()
100 loops, best of 3: 4.23 ms per loop
```