blokeley blokeley - 1 year ago 100
Python Question

Vectorize integration of pandas.DataFrame

I have a

of force-displacement data. The displacement array has been set to the
index, and the columns are my various force curves for different tests.

How do I calculate the work done (which is "the area under the curve")?

I looked at
which seems to do what I need, but I think that I can avoid looping over each column like this:

import numpy as np
import pandas as pd

forces = pd.read_csv(...)
work_done = {}

for col in forces.columns:
work_done[col] = np.trapz(forces.loc[col], forces.index))

I was hoping to create a new
of the areas under the curves rather than a
, and thought that
or something might be appropriate but don't know where to start looking.

In short:

  1. Can I avoid the looping?

  2. Can I create a
    of work done directly?

Thanks in advance for any help.

Answer Source

You could vectorize this by passing the whole DataFrame to np.trapz and specifying the axis= argument, e.g.:

import numpy as np
import pandas as pd

# some random input data
gen = np.random.RandomState(0)
x = gen.randn(100, 10)
names = [chr(97 + i) for i in range(10)]
forces = pd.DataFrame(x, columns=names)

# vectorized version
wrk = np.trapz(forces, x=forces.index, axis=0)
work_done = pd.DataFrame(wrk[None, :], columns=forces.columns)

# non-vectorized version for comparison
work_done2 = {}
for col in forces.columns:
    work_done2.update({col:np.trapz(forces.loc[:, col], forces.index)})

These give the following output:

from pprint import pprint

#            0
# a -24.331560
# b -10.347663
# c   4.662212
# d -12.536040
# e -10.276861
# f   3.406740
# g  -3.712674
# h  -9.508454
# i  -1.044931
# j  15.165782

# {'a': -24.331559643023006,
#  'b': -10.347663159421426,
#  'c': 4.6622123535050459,
#  'd': -12.536039649161403,
#  'e': -10.276861220217308,
#  'f': 3.4067399176289994,
#  'g': -3.7126739591045541,
#  'h': -9.5084536839888187,
#  'i': -1.0449311137294459,
#  'j': 15.165781517623724}

There are a couple of other problems with your original example. col is a column name rather than a row index, so it needs to index the second dimension of your dataframe (i.e. .loc[:, col] rather than .loc[col]). Also, you have an extra trailing parenthesis on the last line.


You could also generate the output DataFrame directly by .applying np.trapz to each column, e.g.:

work_done = forces.apply(np.trapz, axis=0, args=(forces.index,))

However, this isn't really 'proper' vectorization - you are still calling np.trapz separately on each column. You can see this by comparing the speed of the .apply version against calling np.trapz directly:

In [1]: %timeit forces.apply(np.trapz, axis=0, args=(forces.index,))
1000 loops, best of 3: 582 µs per loop

In [2]: %timeit np.trapz(forces, x=forces.index, axis=0)
The slowest run took 6.04 times longer than the fastest. This could mean that an
intermediate result is being cached 
10000 loops, best of 3: 53.4 µs per loop

This isn't an entirely fair comparison, since the second version excludes the extra time taken to construct the DataFrame from the output numpy array, but this should still be smaller than the difference in time taken to perform the actual integration.