Nick Braunagel Nick Braunagel - 1 year ago 111
Python Question

Faster For Loop to Manipulate Data in Pandas

I am working with pandas dataframes that have shapes of ~

(100000, 50)
and although I can achieve the desired data formatting and manipulations, I find my code takes longer than desired to run (3-10mins) depending on the specific task, including:

  1. Combining strings in different columns

  2. Applying a function to each instance within a data frame series

  3. Checking if a value is contained within a separate list or numpy array

I will have larger data frames in the future and want to ensure I'm using the appropriate coding methods to avoid very long processing times. I find my
loops take the longest. I try to avoid
loops with list comprehensions and series operators (e.g.
df.loc[:,'C'] = df.A + df.B
) but in some cases, I need to perform more complicated/involved manipulations with nested
loops. For example, the below iterates through a dataframe's series
(a series of lists), and subsequently iterates through each item within each

for row in DF.iterrows():

removelist = []

for i in xrange(0, len(row[1]['history'])-1):
if ((row[1]['history'][i]['title'] == row[1]['history'][i+1]['title']) &
(row[1]['history'][i]['dept'] == row[1]['history'][i+1]['dept']) &
(row[1]['history'][i]['office'] == row[1]['history'][i+1]['office']) &
(row[1]['history'][i]['employment'] == row[1]['history'][i+1]['employment'])):

newlist = [v for i, v in enumerate(row[1]['history']) if i not in removelist]

I know that list comprehensions can accommodate nested
loops but the above would seem really cumbersome within a list comprehension.

My questions: what other techniques can I use to achieve the same functionality as a
loop with shorter processing time? And when iterating through a series containing lists, should I use a different technique other than a nested

Answer Source

So what you seem to have here is a dataframe where the history entry of each row contains a list of dictionaries? Like:

import pandas as pd
john_history = [{'title': 'a', 'dept': 'cs'}, {'title': 'cj', 'dept': 'sales'}]
jill_history = [{'title': 'boss', 'dept': 'cs'}, {'title': 'boss', 'dept': 'cs'}, {'title': 'junior', 'dept': 'cs'}]
df = pd.DataFrame({'history': [john_history, jill_history], 
    'firstname': ['john', 'jill']})

I would restructure your data so that you use pandas structures at the bottom level of your structure, e.g. a dict of DataFrames where each DataFrame is the history (I don't think Panel works here as the DataFrames may have different lengths):

john_history = pd.DataFrame({'title': ['a', 'cj'], 'dept': ['cs', 'sales']})
john_history['name'] = 'john'
jill_history = pd.DataFrame({'title': ['boss', 'boss', 'junior'], 'dept': ['cs', 'cs', 'cs']})
jill_history['name'] = 'jill'
people = pd.concat([john_history, jill_history])

You can then process them using groupby like:


In general, if you cannot find the functionality you want within pandas/numpy, you should find that using the pandas primitives to create it rather than iterating over a dataframe will be faster. For example to recreate your logic above, first create a new dataframe that is the first one shifted:

df2 = df.shift()

Now you can create a selection by comparing the contents of the dataframes and only keeping the ones that are different and use that to filter the dataframe:

selection_array = (df.history == df2.history) & (df.title == df2.title)
unduplicated_consecutive = df[~selection_array]
  history  id title
0       a   1     x
1       b   2     y
# or in one line:
df[~((df.history == df2.history) & (df.title == df2.title))]
# or:
df[(df.history != df2.history) | (df.title != df2.title)]

So putting this into the groupby:

def drop_consecutive_duplicates(df):
    df2 = df.shift()
    return df.drop(df[(df.dept == df2.dept) & (df.title == df2.title)].index)