durbachit durbachit - 7 months ago 105
Python Question

Butterworth filter applied on a column of a pandas dataframe

I have a dataframe like this (just much bigger, with smaller step of x):

x val1 val2 val3
0 0.0 10.0 NaN NaN
1 0.5 10.5 NaN NaN
2 1.0 11.0 NaN NaN
3 1.5 11.5 NaN 11.60
4 2.0 12.0 NaN 12.08
5 2.5 12.5 12.2 12.56
6 3.0 13.0 19.8 13.04
7 3.5 13.5 13.3 13.52
8 4.0 14.0 19.8 14.00
9 4.5 14.5 14.4 14.48
10 5.0 NaN 19.8 14.96
11 5.5 15.5 15.5 15.44
12 6.0 16.0 19.8 15.92
13 6.5 16.5 16.6 16.40
14 7.0 17.0 19.8 18.00
15 7.5 17.5 17.7 NaN
16 8.0 18.0 19.8 NaN
17 8.5 18.5 18.8 NaN
18 9.0 19.0 19.8 NaN
19 9.5 19.5 19.9 NaN
20 10.0 20.0 19.8 NaN

My original issue was calculating derivatives for each of the columns and it was resolved in this question: How to get indexes of values in a Pandas DataFrame?
The solution posted by Alexander was with my previous code as follows:

import pandas as pd
import numpy as np

df = pd.read_csv('H:/DocumentsRedir/pokus/dataframe.csv', delimiter=',')

vals = list(df.columns.values)[1:]
dVal = df.iloc[:, 1:].diff() # `x` is in column 0.
dX = df['x'].diff()

dVal.apply(lambda series: series / dX)

However, I need to do some smoothing (let's say to 2 m here, from the original 0.5 m spacing of x), because the values of the derivatives just get crazy at the fine scale.
I have tried the scipy function filtfilt and butter (I want to use the butterworth filter, which is a common practice in my discipline), but probably I am not using them correctly. UPDATE: Also tried savgol_filter.

How should I implement these functions in this code?

(This is how I modified the code:

step = 0.5
relevant_scale = 2
order_butterworth = 4
b, a = butter(order_butterworth, step/relevant_scale, btype='low', analog=False)
smoothed=filtfilt(b,a,data.iloc[:, 1:]) # the first column is x
dVal = smoothed.diff()
dz = data['Depth'].diff()
derivative = (dVal.apply(lambda series: series / dz))*1000

But my resulting smoothed was an array of NaNs and got an error
AttributeError: 'numpy.ndarray' object has no attribute 'diff'

This problem was solved by the answer - http://stackoverflow.com/a/38691551/5553319 and the code really works on continuous data. However, what happens with the hardly noticeable change which I made in the source data? (A NaN value in the middle.)
enter image description here

So how can we make this solution stable even in the case we miss a datapoint in an otherwise continuous array of data?
Ok, also answered in the comments. Such missing datapoints need to be interpolated.


The error you are seeing is because you are trying to call the method .diff() on the result of filtfilt, which is a numpy array which doesn't have that method. If you really want to use a first order difference, you can just use np.gradient(smoothed)

Now, it appears that your real goal is to obtain a lag-free estimate of the derivative of a noisy signal. I would recommend that you rather use something like the Savitzky Golay filter which will allow you to get the derivative estimate in one application of the filter. You can see an example of derivative estimation on a noisy signal here

You will also need to accomodate the NaNs in your data. Here is how I would do it with your data:

import scipy.signal
import matplotlib.pyplot as plt

# Intelligent use of the index allows us to keep track of the x for the data.
df = df.set_index('x')
dx = df.index[1]

for col in df:
    # Get rid of nans
    # NOTE: If you have nans in between your data points, this does the wrong thing, 
    # but for the data you show for contiguous data this is fine.
    nonans = df[col].dropna()
    smoothed = scipy.signal.savgol_filter(nonans, 5, 2, deriv=1, delta=dx)
    plt.plot(nonans.index, smoothed, label=col)

This results in the following figure:

Sample plot