Dan Dan - 23 days ago 7
Python Question

Nested queries / comparing multiple datasets efficiently in Pandas

I am using Pandas (first time) to determine whether personnel meet prerequisites when it comes to course attendance. The code below returns the desired results however I am sure there are much better ways to achieve the same outcome.

The criteria to determine if you can undertake physics is as follows;

(Math_A OR Math_B OR Math_C) AND (Eng_A OR Eng_B) AND NOT (Physics)


My question is what efficiencies or alternate methods can be applied to achieve the task.

Reading up on nested queries and alike, I am not able to come up with a way to compare multiple queries in the one query. Ideally, I'd like to have one statement that checks if the person satisfies the prerequisites however I have failed at this so far.

Dataset - Will normally contain > 20,000 records

Emplid,Name,CourseId,CourseName
123,David,P12,Mathematics A
123,David,P13,Mathematics B
123,David,P14,Mathematics C
123,David,P32,Engineering A
456,Sally,P33,Engineering B
789,Roger,P99,Physics


Code

Revised to simplify readability - Thx Boud.

import pandas as pd

def physics_meets_prereqs():

df = pd.DataFrame({'Emplid':['123','123', '123', '123', '456', '789'],
'Name':['David','David','David','David','Sally','Roger'],
'CourseId':['P12','P13','P14','P32','P33','P99'],
'CourseName':['Mathematics A','Mathematics B','Mathematics C','Engineering A','Engineering B', 'Physics']
})

# Get datasets of individually completed courses
has_math = df.query('CourseId == "P12" or CourseId == "P13" or CourseId == "P14"')
has_eng = df.query('CourseId == "P32" or CourseId == "P33"')
has_physics = df.query('CourseId == "P99"')

# Get personnel who have completed math and engineering
has_math_and_eng = has_math[(has_math['Emplid'].isin(has_eng['Emplid']))]

# Remove personnel who have completed physics
has_math_and_eng_no_physics = has_math_and_eng[~(has_math_and_eng['Emplid'].isin(has_physics['Emplid']))]

print(has_math_and_eng_no_physics)

physics_meets_prereqs()


Output

CourseId CourseName Emplid Name
0 P12 Mathematics A 123 David
1 P13 Mathematics B 123 David
2 P14 Mathematics C 123 David


The output is resulting in David being identified as meeting the prerequisites for the physics course. It does list him 3 times which I have not figured out how to limit as yet. The way I am achieving this though can definitely be improved.

In a nutshell

Show me a list of people who have completed at least one of the maths courses, at least one of the engineering courses and have not yet completed the physics course.

Answer
pivoted = df.groupby(['Name', df.CourseName.str.split().str[0]]) \
            .CourseId.size().gt(0).unstack(fill_value=False)

pivoted

enter image description here

matches = pivoted.query('Engineering & Mathematics & ~Physics')
matches

enter image description here

df.query('Name in @matches.index')

enter image description here

Comments