eleanora eleanora - 4 months ago 11
Python Question

Selecting using random pairs of data in pandas

I have a very large file with three columns. The first two are integers and the third is a string. I read it in using pandas using

data = pd.read_csv("edges+stuff.txt", sep=' ', header=None, dtype={0:np.uint32, 1:np.uint32, 2:np.str})


Here is some example fake data:

2 0 Somestuff9
2 0 Somestuff0
1 1 Somestuff5
0 0 Somestuff7
2 0 Somestuff9
2 0 Somestuff5
2 1 Somestuff2
1 1 Somestuff8
1 1 Somestuff2
1 0 Somestuff4
2 1 Somestuff3
0 2 Somestuff9
1 1 Somestuff10
1 0 Somestuff9


I would like to perform the following random sampling which I am stuck on. I want to pick a number of random pairs that exist from the data frame. I don't want to pick a random row as, for example "1 1" occurs four times but I would like to have an equal chance of picking any pair that exists in the data frame. If I did pick "1 1" I would then like to output all the rows that start "1 1".

Using my example fake data, I would like to select some pairs randomly from [(0,0), (1,0), (1,1), (0,2), (2,0), (2,1)] (these are all the pairs that exist in the data) and then use those pairs to select rows from the dataframe.

One way to do this would be to take the first two columns, sort them and perform the equivalent of
np.unique
. Then select random pairs from this uniqued list and then use them to select from the original dataframe.


Is there some way to do this efficiently in pandas?

Answer

Here's one way:

df.head()
Out: 
   col1  col2        col3
0     2     0  Somestuff9
1     2     0  Somestuff0
2     1     1  Somestuff5
3     0     0  Somestuff7
4     2     0  Somestuff9

Select one pair randomly:

df[['col1', 'col2']].drop_duplicates().sample(n=1)
Out: 
   col1  col2
0     2     1

(Here drop duplicates drops all rows that have the same col1 col2 pairs except for the first one and .sample(n=1) select one among them.)

All rows that have col1=2, col2=1:

df[['col1', 'col2']].drop_duplicates().sample(n=1).merge(df)
Out: 
   col1  col2        col3
0     2     1  Somestuff2
1     2     1  Somestuff3