user1 user1 - 1 month ago 8
R Question

Sorting data frame by column, adding index within group

This question describes the setting for my question pretty well.

Instead of a second value however, I have a factor called

algorithm
. My data frame looks like the following (note the possibility of multiplicity of values even within their group):

algorithm <- c("global", "distributed", "distributed", "none", "global", "global", "distributed", "none", "none")
v <- c(5, 2, 6, 7, 3, 1, 10, 2, 2)
df <- data.frame(algorithm, v)
df
algorithm v
1 global 5
2 distributed 2
3 distributed 6
4 none 7
5 global 3
6 global 1
7 distributed 10
8 none 2
9 none 2


I would like to sort the dataframe by
v
but get the ordering position for every entry with respect to its group (algorithm). This position should then be added to the original data frame (so I don't need to rearrange it) because I would like to plot the calculated position as x and the value as y using a ggplot (grouped by algorithm, e.g. every algorithm is one set of points).

So the result should look like this:

algorithm v groupIndex
1 global 5 3
2 distributed 2 1
3 distributed 6 2
4 none 7 3
5 global 3 2
6 global 1 1
7 distributed 10 3
8 none 2 1
9 none 2 2


So far I know I can order the data by algorithm first and then by value or the other way round. I guess in a second step I would have to calculate the index within each group? Is there an easy way to do that?

df[order(df$algorithm, df$v), ]
algorithm v
2 distributed 2
3 distributed 6
7 distributed 10
6 global 1
5 global 3
1 global 5
8 none 2
9 none 2
4 none 7


Edit: It is not guaranteed, that there is the same amount of entries for each group!

Answer

A double application of order in each group should cover it:

ave(df$v, df$algorithm, FUN=function(x) order(order(x)) )
#[1] 3 1 2 3 2 1 3 1 2

Which is also equivalent to:

ave(df$v, df$algorithm, FUN=function(x) rank(x,ties.method="first") )
#[1] 3 1 2 3 2 1 3 1 2

, which in turn means you can take advantage of frank from data.table if you are concerned about speed:

setDT(df)[, grpidx := frank(v,ties.method="first"), by=algorithm]
df
#     algorithm  v grpidx
#1:      global  5      3
#2: distributed  2      1
#3: distributed  6      2
#4:        none  7      3
#5:      global  3      2
#6:      global  1      1
#7: distributed 10      3
#8:        none  2      1
#9:        none  2      2
Comments