Jerome Li Jerome Li - 1 month ago 5x
Python Question

How to get rows with the max value by using Python?

I have R code to use data table to merger the rows with same FirstName and LastName but selecting the max value for specified columns(e.g. Score1, Score2, Score3). The input/output is as follows:


FirstName LastName Score1 Score2 Score3
fn1 ln1 41 88 50
fn1 ln1 72 66 77
fn1 ln1 69 72 90
fn2 ln2 80 81 73
fn2 ln2 59 91 66
fn3 ln3 75 80 66


FirstName LastName Score1 Score2 Score3
fn1 ln1 72 88 90
fn2 ln2 80 91 73
fn3 ln3 75 80 66

Now I want to migrate the R program to Spark. How can I do this by using Python?


As suggested by durbachit, you'll want to use pandas.

import pandas as pd
df = pd.read_csv(**your file here**)
max_df = df.groupby(by=['FirstName','LastName']).max()

And max_df will be your desired output. Docs for pandas groupby.