Dan Dan - 11 months ago 58
Python Question

Nested queries / comparing multiple datasets efficiently in Pandas

I am using Pandas (first time) to determine whether personnel meet prerequisites when it comes to course attendance. The code below returns the desired results however I am sure there are much better ways to achieve the same outcome.

The criteria to determine if you can undertake physics is as follows;

(Math_A OR Math_B OR Math_C) AND (Eng_A OR Eng_B) AND NOT (Physics)


My question is what efficiencies or alternate methods can be applied to achieve the task.

Reading up on nested queries and alike, I am not able to come up with a way to compare multiple queries in the one query. Ideally, I'd like to have one statement that checks if the person satisfies the prerequisites however I have failed at this so far.

Dataset - Will normally contain > 20,000 records

Emplid,Name,CourseId,CourseName
123,David,P12,Mathematics A
123,David,P13,Mathematics B
123,David,P14,Mathematics C
123,David,P32,Engineering A
456,Sally,P33,Engineering B
789,Roger,P99,Physics


Code

Revised to simplify readability - Thx Boud.

import pandas as pd

def physics_meets_prereqs():

df = pd.DataFrame({'Emplid':['123','123', '123', '123', '456', '789'],
'Name':['David','David','David','David','Sally','Roger'],
'CourseId':['P12','P13','P14','P32','P33','P99'],
'CourseName':['Mathematics A','Mathematics B','Mathematics C','Engineering A','Engineering B', 'Physics']
})

# Get datasets of individually completed courses
has_math = df.query('CourseId == "P12" or CourseId == "P13" or CourseId == "P14"')
has_eng = df.query('CourseId == "P32" or CourseId == "P33"')
has_physics = df.query('CourseId == "P99"')

# Get personnel who have completed math and engineering
has_math_and_eng = has_math[(has_math['Emplid'].isin(has_eng['Emplid']))]

# Remove personnel who have completed physics
has_math_and_eng_no_physics = has_math_and_eng[~(has_math_and_eng['Emplid'].isin(has_physics['Emplid']))]

print(has_math_and_eng_no_physics)

physics_meets_prereqs()


Output

CourseId CourseName Emplid Name
0 P12 Mathematics A 123 David
1 P13 Mathematics B 123 David
2 P14 Mathematics C 123 David


The output is resulting in David being identified as meeting the prerequisites for the physics course. It does list him 3 times which I have not figured out how to limit as yet. The way I am achieving this though can definitely be improved.

In a nutshell

Show me a list of people who have completed at least one of the maths courses, at least one of the engineering courses and have not yet completed the physics course.

Answer Source
pivoted = df.groupby(['Name', df.CourseName.str.split().str[0]]) \
            .CourseId.size().gt(0).unstack(fill_value=False)

pivoted

enter image description here

matches = pivoted.query('Engineering & Mathematics & ~Physics')
matches

enter image description here

df.query('Name in @matches.index')

enter image description here