Thomas Johnson - 1 year ago 154

Python Question

What's the most efficient way to drop only consecutive duplicates in pandas?

drop_duplicates gives this:

`In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])`

In [4]: a.drop_duplicates()

Out[4]:

1 1

2 2

4 3

dtype: int64

But I want this:

`In [4]: a.something()`

Out[4]:

1 1

2 2

4 3

5 2

dtype: int64

Answer Source

Use shift:

```
a.loc[a.shift(-1) != a]
Out[3]:
1 1
3 2
4 3
5 2
dtype: int64
```

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

Another method is to use `diff`

:

```
In [82]:
a.loc[a.diff() != 0]
Out[82]:
1 1
2 2
4 3
5 2
dtype: int64
```

But this is slower than the original method if you have a large number of rows.

**Update**

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use `shift(1)`

or just `shift()`

as the default is a period of 1, this returns the first consecutive value:

```
In [87]:
a.loc[a.shift() != a]
Out[87]:
1 1
2 2
4 3
5 2
dtype: int64
```

Note the difference in index values, thanks @BjarkeEbert!