Federico Gentile Federico Gentile - 14 days ago 7
Python Question

How to refactor simple dataframe parsing code with Pandas

I am using Pandas to parse a dataframe that I have created:

# Initial DF
A B C
0 -1 qqq XXX
1 20 www CCC
2 30 eee VVV
3 -1 rrr BBB
4 50 ttt NNN
5 60 yyy MMM
6 70 uuu LLL
7 -1 iii KKK
8 -1 ooo JJJ


My goal is to analyze column A and apply the following conditions to the dataframe:


  1. Investigate every row

  2. determine if
    df['A'].iloc[index]=-1

  3. if true and
    index=0
    mark first row as to be removed

  4. if true and
    index=N
    mark last row as to be removed

  5. if
    0<index<N
    and
    df['A'].iloc[index]=-1
    and the previous or following row contain -1 (
    df['A'].iloc[index+]=-1
    or
    df['A'].iloc[index-1]=-1
    ), mark row as to be removed; else replace
    -1 with the average of the previous and following value



The final dataframe should look like this:

# Final DF
A B C
0 20 www CCC
1 30 eee VVV
2 40 rrr BBB
3 50 ttt NNN
4 60 yyy MMM
5 70 uuu LLL


I was able to achieve my goal by writing a simple code that applies the conditions mentioned above:

import pandas as pd

# create dataframe
data = {'A':[-1,20,30,-1,50,60,70,-1,-1],
'B':['qqq','www','eee','rrr','ttt','yyy','uuu','iii','ooo'],
'C':['XXX','CCC','VVV','BBB','NNN','MMM','LLL','KKK','JJJ']}
df = pd.DataFrame(data)

# If df['A'].iloc[index]==-1:
# - option 1: remove row if first or last row are equal to -1
# - option 2: remove row if previous or following row contains -1 (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1)
# - option 3: replace df['A'].iloc[index] if: df['A'].iloc[index]==-1 and (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1)
N = len(df.index) # number of rows
index_vect = [] # store indexes of rows to be deleated
for index in range(0,N):

# option 1
if index==0 and df['A'].iloc[index]==-1:
index_vect.append(index)
elif index>1 and index<N and df['A'].iloc[index]==-1:

# option 2
if df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1:
index_vect.append(index)

# option 3
else:
df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2)

# option 1
elif index==N and df['A'].iloc[index]==-1:
index_vect.append(index)

# remove rows to be deleated
df = df.drop(index_vect).reset_index(drop = True)


As you can see the code is pretty long and I would like to know if you can suggest a smarter and more efficient way to obtain the same result.
Furthermore I noticed my code return a warning message cause by the line
df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2)

Do you know how I could optimize such line of code?

Answer

Here's a solution:

import numpy as np

# Let's replace -1 by Not a Number (NaN)
df.ix[df.A==-1,'A'] = np.nan

# If df.A is NaN and either the previous or next is also NaN, we don't select it
# This takes care of the condition on the first and last row too
df = df[~(df.A.isnull() & (df.A.shift(1).isnull() | df.A.shift(-1).isnull()))]

# Use interpolate to fill with the average of previous and next
df.A = df.A.interpolate(method='linear', limit=1)

Here's the resulting df:

    A       B       C
1   20.0    www     CCC
2   30.0    eee     VVV
3   40.0    rrr     BBB
4   50.0    ttt     NNN
5   60.0    yyy     MMM
6   70.0    uuu     LLL

You can then reset the index if you want to.

Comments