I have a dataframe (df) with distance traveled and I have assigned a label based on certain conditions.
df = pd.DataFrame(distance,columns=["distance"])
for i in range(0, len(df['distance'])):
In general, you shouldn't loop over DataFrames unless it's absolutely necessary. You'll usually get much better performance using a built-in Pandas function that's already been optimized, or by using a vectorized approach.
In this case, you can use
loc and Boolean indexing to do the assignments:
# Initialize as 1 (eliminate need to check the first condition). df['label'] = 1 # Case 1: Between 0.1 and 0.5 df.loc[(df['distance'] > 0.1) & (df['distance'] <= 0.5), 'label'] = 2 # Case 2: Greater than 0.5 df.loc[df['distance'] > 0.5, 'label'] = 3
Another option is to use
pd.cut. This is a method is a little more specialized to the example problem in the question. Boolean indexing is a more general method.
# Get the low and high bins. low, high = df['distance'].min()-1, df['distance'].max()+1 # Perform the cut. Add one since the labels start at zero by default. df['label'] = pd.cut(df['distance'], bins=[low, 0.1, 0.5, high], labels=False) + 1
You could also use
labels=[1,2,3] in the code above, and not add 1 to the result. This would give
df['labels'] categorical dtype instead of integer dtype though. Depending on your use case this may or may not be important.
The resulting output for either method:
distance label 0 0.0000 1 1 0.0001 1 2 0.2000 2 3 1.2300 3 4 4.0000 3