Joe Joe - 3 months ago 8
Python Question

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:

df['A'] = [1,1,1,0,1,1,1,1,0,1]


What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:

A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1

Answer

One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:

df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
          .astype(int) # cast the boolean Series back to integers

This produces the new column in the DataFrame:

   A  B
0  1  1
1  1  1
2  1  0
3  0  0
4  1  1
5  1  1
6  1  0
7  1  0
8  0  0
9  1  1

1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"

Comments