user3667569 user3667569 - 1 month ago 4
Python Question

How to count how many points are "better" than other points in pandas dataframe?

I have a dataframe in pandas which look something like this:

>>> df[1:3]
0 1 2 3 4 5 6 7 8
1 -0.59 -99.0 924.0 20.1 5.0 4.0 57.0 19.0 8.0
2 -1.30 -279.0 297.0 16.1 30.0 4.4 63.0 19.0 10.0


The number of points in the dataframe is ~1000.
Given a set of columns, I want to find out how many time each point is "better" than the other?
Given a set of
n
columns, a point is better than other point, if it better in at least one of the columns and equal in other columns.
A point which is better in one column and worse in
n-1
is not considered better because its better than the other point in at least one column.

Edit1: Example:

>>> df
0 1 2
1 -0.59 -99.0 924.0
2 -1.30 -279.0 297.0
3 2.00 -100.0 500.0
4 0.0 0.0 0.0


If we consider only column 0, then the result would be:

1 - 1
2 - 0
3 - 3
4 - 2


because point 1 (-0.59) is only better than point 2 with respect to column 1.

Another example by taking columns - 0 and 1:

1 - 1 (only for point 2 all values i.e. column 0 and column 1 are either smaller than point 1 or lesser)
2 - 0 (since no point is has any lesser than this in any dimension)
3 - 1 (point 2)
4 - 2 (point 1 and 2)


Edit 2:
Perhaps, something like a function which when given a dataframe, a point (index of point) and a set of columns could give the count as - for each subset of columns how many times that point is better than other points.

def f(p, df, c):
"""returns
A list : L = [(c1,n), (c2,m)..]
where c1 is a proper subset of c and n is the number of times that this point was better than other points."""

Answer

rank each column separately
by ranking each column, I can see exactly how many other rows in that column the particular row you're in is greater than.

d1 = df.rank().sub(1)
d1

enter image description here

to solve your problem, it logically has to be the case that for a particular row, the smallest rank among the row elements is precisely the number of other rows in which every element in this row is greater than.

for the first two columns [0, 1], it can be calculated by by taking the min of d1

I use this for reference to compare the raw first two columns with the ranks

pd.concat([df.iloc[:, :2], d1.iloc[:, :2]], axis=1, keys=['raw', 'ranked'])

enter image description here

Take the min as stated above.

d1.iloc[:, :2].min(1)

1    1.0
2    0.0
3    1.0
4    2.0
dtype: float64

put the result next to raw data and ranks so we can see it

pd.concat([df.iloc[:, :2], d1.iloc[:, :2], d1.iloc[:, :2].min(1)],
          axis=1, keys=['raw', 'ranked', 'results'])

enter image description here

sure enough, that ties out with your expected results.

Comments