Jeff Saltfist Jeff Saltfist - 5 months ago 16
Python Question

Pandas Column Manipulation

I am trying to manipulate a pandas df so I can count the frequency of occurrences of a data point in one column after a specific occurrence another column. My psuedo code below probably sums it up best. Any help would be much appreciated!

import datetime
import time
import pandas as pd

# Set number of rows to skip
rows_to_skip = 0
# Rows to use after skipped rows
rows_to_use = 10000

# Read the file (Adjust arguments accordingly)
data = pd.read_csv('example.csv',skiprows=rows_to_skip, error_bad_lines=False, nrows=rows_to_use, low_memory=False)

# Add headers when skipping rows
data.columns = ["X","Y","Z"]

# Psuedo Code Below

for variable in data['X']:
if variable > 0:
# Count number of times the following conditions are met in all subsequent rows:
condition 1 ) Y > 0
condition 2 ) Z <= Z of the row where variable was > 0

# Then I want to add the total count to a new column, and have it in the same row as X when the "variable" > 0.


Any help?

Answer

Assuming you want the count for the cases between every instance where X>0, as opposed to the count for the remainder of the entire DataFrame after each X>0:

You could create a new column that indicates where the X>0 condition holds True, .fillna(method='ffill') and .groupby() on the result. Then you only need to .apply() to get the len() of the group where the other conditions are True.

Some sample data:

df = pd.DataFrame(data=np.random.randint(-10, 10, size=(100, 3)), columns=list('XYZ'))

   X   Y  Z
0 -3   6 -7
1 -4 -10 -1
2  9 -10 -9
3  5   0 -8
4 -2   1 -8

Proceed as follows:

df['condition'] = df.index.to_series().where(df.X > 0).fillna(method='ffill')
df['count'] = df.groupby('condition').apply(lambda x: len(x[(x.Y>0) & (x.Z > x.Z.iloc[0])]))

to get:

    X   Y  Z  condition  count
0  -3   6 -7        NaN    NaN
1  -4 -10 -1        NaN    NaN
2   9 -10 -9        2.0    0.0
3   5   0 -8        3.0    0.0
4  -2   1 -8        3.0    NaN
5   6  -6 -3        5.0    1.0
6   0   6  3        5.0    NaN
7  -6  -7 -6        5.0    NaN
8   7  -2 -5        8.0    0.0
9   0  -1  5        8.0    NaN
10  5   8 -3       10.0    0.0
11 -2  -2  1       10.0    NaN
12  3   4  2       12.0    1.0
13 -5   1 -9       12.0    NaN
14 -7   2  6       12.0    NaN
15  1 -10  6       15.0    0.0
16  1  -8  6       16.0    0.0
17 -4  -9 -8       16.0    NaN
18 -9   4  6       16.0    NaN
19  5  -6  2       19.0    0.0
20  5   7 -1       20.0    0.0
21  2  -2 -3       21.0    0.0
22 -6 -10 -2       21.0    NaN
23 -7  -9  3       21.0    NaN
24 -8   7 -8       21.0    NaN
25  3  -3  6       25.0    0.0
26  1  -6 -3       26.0    1.0
27 -4   6 -1       26.0    NaN
28  6  -4  9       28.0    0.0
29 -8   2  1       28.0    NaN
.. ..  .. ..        ...    ...
70 -5   7 -6       68.0    NaN
71  6   6 -7       71.0    1.0
72 -3   0  3       71.0    NaN
73 -5   3  2       71.0    NaN
74 -6  -8  8       71.0    NaN
75  1   0 -4       75.0    0.0
76  7  -9 -5       76.0    0.0
77  1   0 -1       77.0    0.0
78  5   9 -2       78.0    0.0
79 -8  -9 -6       78.0    NaN
80  2  -3  3       80.0    3.0
81 -7  -5  8       80.0    NaN
82 -4  -5 -7       80.0    NaN
83 -3   5 -6       80.0    NaN
84 -5   1  4       80.0    NaN
85 -1   6  7       80.0    NaN
86 -7   4  4       80.0    NaN
87 -7  -4 -1       80.0    NaN
88 -2  -8  2       80.0    NaN
89  4   6  4       89.0    0.0
90  4 -10 -8       90.0    0.0
91 -7  -9  5       90.0    NaN
92  5   3 -1       92.0    0.0
93  6   6  6       93.0    0.0
94  9  -2  0       94.0    1.0
95 -1   5  5       94.0    NaN
96  2   8 -9       96.0    2.0
97 -6   7 -4       96.0    NaN
98 -1   7 -8       96.0    NaN
99 -4   0 -1       96.0    NaN