dmeu dmeu - 1 month ago 3
Python Question

Is there a query method or similar for pandas Series (pandas.Series.query())?

The

pandas.DataFrame.query()
method is of great usage for (pre/post)-filtering data when loading or plotting. It comes particularly handy for method chaining.

I find myself often wanting to apply the same logic to a
pandas.Series
, e.g. after having done a method such as
df.value_counts
which returns a
pandas.Series
.

Example



Lets assume there is a huge table with the columns
Player, Game, Points
and I want to plot a histogram of the players with more than 14 times 3 points. I first have to sum the points of each player (
groupby -> agg
) which will return a Series of ~1000 players and their overall points. Applying the
.query
logic it would look something like this:

df = pd.DataFrame({
'Points': [random.choice([1,3]) for x in range(100)],
'Player': [random.choice(["A","B","C"]) for x in range(100)]})

(df
.query("Points == 3")
.Player.values_count()
.query("> 14")
.hist())


The only solutions I find force me to do an unnecessary assignment and break the method chaining:

(points_series = df
.query("Points == 3")
.groupby("Player").size()
points_series[points_series > 100].hist()


Method chaining as well as the query method help to keep the code legible meanwhile the subsetting-filtering can get messy quite quickly.

# just to make my point :)
series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape


Please help me out of my dilemma! Thanks

Answer

IIUC you can add query("Points > 100"):

df = pd.DataFrame({'Points':[50,20,38,90,0, np.Inf],
                   'Player':['a','a','a','s','s','s']})

print (df)
  Player     Points
0      a  50.000000
1      a  20.000000
2      a  38.000000
3      s  90.000000
4      s   0.000000
5      s        inf

points_series = df.query("Points < inf").groupby("Player").agg({"Points": "sum"})['Points']
print (points_series)     
a = points_series[points_series > 100]
print (a)     
Player
a    108.0
Name: Points, dtype: float64


points_series = df.query("Points < inf")
                  .groupby("Player")
                  .agg({"Points": "sum"})
                  .query("Points > 100")

print (points_series)     
        Points
Player        
a        108.0

Another solution is Selection By Callable:

points_series = df.query("Points < inf")
                  .groupby("Player")
                  .agg({"Points": "sum"})['Points']
                  .loc[lambda x: x > 100]

print (points_series)     
Player
a    108.0
Name: Points, dtype: float64