guilhermecgs guilhermecgs - 1 month ago 14
Python Question

Pandas "diff()" with string

How can I flag a row in a dataframe every time a column change its string value?

Ex:

Input

ColumnA ColumnB
1 Blue
2 Blue
3 Red
4 Red
5 Yellow


# diff won't work here with strings.... only works in numerical values
dataframe['changed'] = dataframe['ColumnB'].diff()


ColumnA ColumnB changed
1 Blue 0
2 Blue 0
3 Red 1
4 Red 0
5 Yellow 1

Answer

I get better performance with ne instead of using the actual != comparison:

df['changed'] = df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int)

Timings

Using the following setup to produce a larger dataframe:

df = pd.concat([df]*10**5, ignore_index=True) 

I get the following timings:

%timeit df['ColumnB'].ne(df['ColumnB'].shift().bfill()).astype(int)
10 loops, best of 3: 38.1 ms per loop

%timeit (df.ColumnB != df.ColumnB.shift()).astype(int)
10 loops, best of 3: 77.7 ms per loop

%timeit df['ColumnB'] == df['ColumnB'].shift(1).fillna(df['ColumnB'])
10 loops, best of 3: 99.6 ms per loop

%timeit (df.ColumnB.ne(df.ColumnB.shift())).astype(int)
10 loops, best of 3: 19.3 ms per loop
Comments