sshr sshr - 6 months ago 49
Python Question

Optimizing for loop in python

I have a dataframe (df) with distance traveled and I have assigned a label based on certain conditions.

df = pd.DataFrame(distance,columns=["distance"])
for i in range(0, len(df['distance'])):
if (df['distance'].values[i])<=0.10:
elif (df['distance'].values[i])<=0.50:
elif (df['distance'].values[i])>0.50:

This is working fine. However, I have more than 1 million records with distance and this for loop is taking longer time than expected. Can we optimize this code to reduce the execution time?


In general, you shouldn't loop over DataFrames unless it's absolutely necessary. You'll usually get much better performance using a built-in Pandas function that's already been optimized, or by using a vectorized approach.

In this case, you can use loc and Boolean indexing to do the assignments:

# Initialize as 1 (eliminate need to check the first condition).
df['label'] = 1

# Case 1: Between 0.1 and 0.5
df.loc[(df['distance'] > 0.1) & (df['distance'] <= 0.5), 'label'] = 2

# Case 2: Greater than 0.5
df.loc[df['distance'] > 0.5, 'label'] = 3

Another option is to use pd.cut. This is a method is a little more specialized to the example problem in the question. Boolean indexing is a more general method.

# Get the low and high bins.
low, high = df['distance'].min()-1, df['distance'].max()+1

# Perform the cut.  Add one since the labels start at zero by default.
df['label'] = pd.cut(df['distance'], bins=[low, 0.1, 0.5, high], labels=False) + 1

You could also use labels=[1,2,3] in the code above, and not add 1 to the result. This would give df['labels'] categorical dtype instead of integer dtype though. Depending on your use case this may or may not be important.

The resulting output for either method:

   distance  label
0    0.0000      1
1    0.0001      1
2    0.2000      2
3    1.2300      3
4    4.0000      3