Joe - 3 months ago 8
Python Question

# Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:

``````df['A'] = [1,1,1,0,1,1,1,1,0,1]
``````

What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:

``````   A  B
0  1  1
1  1  1
2  1  0
3  0  0
4  1  1
5  1  1
6  1  0
7  1  0
8  0  0
9  1  1
``````

One fully-vectorized solution is to use the `shift`-`groupby`-`cumsum`-`cumcount` combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, `&` this new boolean Series with the original column:

``````df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
``````

This produces the new column in the DataFrame:

``````   A  B
0  1  1
1  1  1
2  1  0
3  0  0
4  1  1
5  1  1
6  1  0
7  1  0
8  0  0
9  1  1
``````

1 See the pandas cookbook; the section on grouping, "Grouping like Pythonâ€™s itertools.groupby"