Joel Underwood Joel Underwood - 1 year ago 67
Python Question

faster way to run countifs in python

I previously asked the question on how to do a countifs in python across multiple data frames, just like you can do countifs on separate worksheets in Excel. somebody gave me a very creative answer:

python pandas countifs using multiple criteria AND multiple data frames

Thank you for that @AlexG--I tried it, and it worked superbly:

import pandas as pd
import numpy as np
import matplotlib as plt

#import the data
students = pd.read_csv("Student Detail stump.csv")
exams = pd.read_csv("Exam Detail stump.csv")

#get data parameters
student_info = students[['Student Number', 'Enrollment Date', 'Detail Date']].values

#prepare an empty list to hold the results
N_exams_passed = []

#count records in data set according to parameters
for s_id, s_enroll, s_qual in student_info:
N_exams_passed.append(len(exams[(exams['Student Number']==s_id) &
(exams['Exam Grade Date']>=s_enroll) &
(exams['Exam Grade Date']<=s_qual) &
(exams['Exam Grade']>=70)])

#add the results to the original data set
students['Exams Passed'] = N_exams_passed

HOWEVER, it only worked effectively on small data sets. When I ran the data with 100,000s of rows, it wouldn't even be done overnight. It doesn't seem very pythonic.

The SQL way you can do this in seconds is to use a correlated subquery, like this:

(SELECT COUNT(e.[Exam Grade])
exams AS e
e.[Exam Grade] >= 65
AND e.[Student Number] = s.[Student Number]
AND e.[Exam Grade Date] >= s.[Enrollment Date]
AND e.[Exam Grade Date] <= s.[Detail Date])
AS ExamsPassed
students AS s;

How do I reproduce such a correlated subquery in pandas or some other pythonic way?

Here are the data frames:

Student Number Enroll Date Detail Date
1 1/1/2016 2/1/2016
1 1/1/2016 3/1/2016
2 2/1/2016 3/1/2016
3 3/1/2016 4/1/2016

Student Number Exam Date Exam Grade
1 1/1/2016 50
1 1/15/2016 80
1 1/28/2016 90
1 2/5/2016 100
1 3/5/2016 80
1 4/5/2016 40
2 2/2/2016 85
2 2/3/2016 10
2 2/4/2016 100

Final data frame should look like this, with a count of 'Passed Exams' at the end:

Student Number Enroll Date Detail Date Passed Exams
1 1/1/2016 2/1/2016 2
1 1/1/2016 3/1/2016 3
2 2/1/2016 3/1/2016 2
3 3/1/2016 4/1/2016 0

Answer Source

If I understand the structure of your dataframes correctly, I'd suggest merging the two dataframes and then performing the task on the merged data using numpy.where.

import numpy as np

exams = exams.merge(students, on='Student Number', how='left')
exams['Passed'] = np.where(
    (exams['Exam Grade Date'] >= exams['Enrollment Date']) &
    (exams['Exam Grade Date'] <= exams['Detail Date']) &
    (exams['Grade'] >= 70),
    1, 0)

students = students.merge(
    exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum().reset_index(),
    left_on=['Student Number', 'Detail Date'],
    right_on=['Student Number', 'Detail Date'],
students['Passed'] = students['Passed'].fillna(0).astype('int')

Note: you'll need to make sure the date columns are properly stored as datetimes (you can use pandas.to_datetime to do this).

numpy.where creates a new array where the values are one way (1 in the example above) if the conditions you specify are met and another (0) if they aren't met.

The line exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum() produces a series in which the index is Student Number and Detail Date and the values are the counts of passed exams corresponding to that Student Number and Detail Date combination. The reset_index() makes it into a dataframe for merging.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download