Federico Gentile Federico Gentile - 1 year ago 104
Python Question

How to refactor simple dataframe parsing code with Pandas

I am using Pandas to parse a dataframe that I have created:

# Initial DF
0 -1 qqq XXX
1 20 www CCC
2 30 eee VVV
3 -1 rrr BBB
4 50 ttt NNN
5 60 yyy MMM
6 70 uuu LLL
7 -1 iii KKK
8 -1 ooo JJJ

My goal is to analyze column A and apply the following conditions to the dataframe:

  1. Investigate every row

  2. determine if

  3. if true and
    mark first row as to be removed

  4. if true and
    mark last row as to be removed

  5. if
    and the previous or following row contain -1 (
    ), mark row as to be removed; else replace
    -1 with the average of the previous and following value

The final dataframe should look like this:

# Final DF
0 20 www CCC
1 30 eee VVV
2 40 rrr BBB
3 50 ttt NNN
4 60 yyy MMM
5 70 uuu LLL

I was able to achieve my goal by writing a simple code that applies the conditions mentioned above:

import pandas as pd

# create dataframe
data = {'A':[-1,20,30,-1,50,60,70,-1,-1],
df = pd.DataFrame(data)

# If df['A'].iloc[index]==-1:
# - option 1: remove row if first or last row are equal to -1
# - option 2: remove row if previous or following row contains -1 (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1)
# - option 3: replace df['A'].iloc[index] if: df['A'].iloc[index]==-1 and (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1)
N = len(df.index) # number of rows
index_vect = [] # store indexes of rows to be deleated
for index in range(0,N):

# option 1
if index==0 and df['A'].iloc[index]==-1:
elif index>1 and index<N and df['A'].iloc[index]==-1:

# option 2
if df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1:

# option 3
df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2)

# option 1
elif index==N and df['A'].iloc[index]==-1:

# remove rows to be deleated
df = df.drop(index_vect).reset_index(drop = True)

As you can see the code is pretty long and I would like to know if you can suggest a smarter and more efficient way to obtain the same result.
Furthermore I noticed my code return a warning message cause by the line
df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2)

Do you know how I could optimize such line of code?

Answer Source

Here's a solution:

import numpy as np

# Let's replace -1 by Not a Number (NaN)
df.ix[df.A==-1,'A'] = np.nan

# If df.A is NaN and either the previous or next is also NaN, we don't select it
# This takes care of the condition on the first and last row too
df = df[~(df.A.isnull() & (df.A.shift(1).isnull() | df.A.shift(-1).isnull()))]

# Use interpolate to fill with the average of previous and next
df.A = df.A.interpolate(method='linear', limit=1)

Here's the resulting df:

    A       B       C
1   20.0    www     CCC
2   30.0    eee     VVV
3   40.0    rrr     BBB
4   50.0    ttt     NNN
5   60.0    yyy     MMM
6   70.0    uuu     LLL

You can then reset the index if you want to.