Ray Ray - 2 months ago 14
Python Question

Python Pandas select rows based on membership of another collection (set)

Suppose I have a

DataFrame
constructed as follows:

import pandas
import numpy

column_names = ["name", "age", "score"]
names = numpy.random.choice(["Jorge", "Xavier", "Joaquin", "Juan", "Jose"], 50)
ages = numpy.random.randint(0, 100, 50)
scores = numpy.random.rand(50)
df = pandas.DataFrame.from_dict(dict(zip(column_names, [names, ages, scores])))


The top 10 rows of the above
DataFrame
looks like the following.

age name score
0 15 Jorge 0.031380
1 44 Juan 0.373199
2 84 Xavier 0.999065
3 55 Juan 0.159873
4 55 Joaquin 0.211931
5 33 Juan 0.484350
6 22 Xavier 0.510276
7 86 Joaquin 0.490013
8 2 Jose 0.185086
9 51 Juan 0.979015


I want to be able to select rows for which the elements of the
name
column is a member of
{"Xavier", "Joaquin"}
. Instinctively I'm thinking of something like
df.iloc[df["name"] in {"Xavier", "Joaquin"}, :]
but that doesn't work. So how do I achieve it?

Note



I know I can achieve this particular example by

df.loc[numpy.logical_or(df["name"] == "Xavier", df["name"] == "Joaquin"), :]


but that's not the point. This is just a simplified example of my real problem. I have a
DataFrame
of height 2,340,923 and a name set
names
of size 3,624 and I want to select the rows whose names are members of the name set
names
.

Answer

I think you need isin:

print (df.loc[df["name"].isin(["Xavier", "Joaquin"]), :])
    age     name     score
1    66  Joaquin  0.767056
2    17  Joaquin  0.721369
7    53  Joaquin  0.209415
10    9   Xavier  0.394815
13   20  Joaquin  0.276596
14   17   Xavier  0.810725
15   76   Xavier  0.918273
17   91  Joaquin  0.974723
18   39   Xavier  0.869607
21    3   Xavier  0.200578
22   34  Joaquin  0.938018
23   90   Xavier  0.664387
26   51   Xavier  0.946753
28   49   Xavier  0.859911
30   22  Joaquin  0.602381
34    7   Xavier  0.759837
35   96  Joaquin  0.790691
39   13  Joaquin  0.599557
40   10   Xavier  0.563933
41   69   Xavier  0.983787
43   58   Xavier  0.542903
44    8  Joaquin  0.307106
45   77  Joaquin  0.330278
46   55  Joaquin  0.980077
47   12   Xavier  0.177509
49   15  Joaquin  0.590958

It works with set nice also:

names = set(["Xavier", "Joaquin"])
print (df.loc[df["name"].isin(names), :])

    age     name     score
1    66  Joaquin  0.767056
2    17  Joaquin  0.721369
7    53  Joaquin  0.209415
10    9   Xavier  0.394815
13   20  Joaquin  0.276596
14   17   Xavier  0.810725
15   76   Xavier  0.918273
17   91  Joaquin  0.974723
18   39   Xavier  0.869607
21    3   Xavier  0.200578
22   34  Joaquin  0.938018
23   90   Xavier  0.664387
26   51   Xavier  0.946753
28   49   Xavier  0.859911
30   22  Joaquin  0.602381
34    7   Xavier  0.759837
35   96  Joaquin  0.790691
39   13  Joaquin  0.599557
40   10   Xavier  0.563933
41   69   Xavier  0.983787
43   58   Xavier  0.542903
44    8  Joaquin  0.307106
45   77  Joaquin  0.330278
46   55  Joaquin  0.980077
47   12   Xavier  0.177509
49   15  Joaquin  0.590958