Ruben - 1 year ago 59
R Question

# How can I rank observations in-group faster?

I have a really simple problem, but I'm probably not thinking vector-y enough to solve it efficiently. I tried two different approaches and they've been looping on two different computers for a long time now. I wish I could say the competition made it more exciting, but ... bleh.

## rank observations in group

I have long data (many rows per person, one row per person-observation) and I basically want a variable, that tells me how often the person has been observed already.

I have the first two columns and want the third one:

``````person  wave   obs
pers1   1999   1
pers1   2000   2
pers1   2003   3
pers2   1998   1
pers2   2001   2
``````

Now I'm using two loop-approaches. Both are excruciatingly slow (150k rows). I'm sure I'm missing something, but my search queries didn't really help me yet (hard to phrase the problem).

Thanks for any pointers!

``````# ordered dataset by persnr and year of observation
person.obs <- person.obs[order(person.obs\$PERSNR,person.obs\$wave) , ]

person.obs\$n.obs = 0

# first approach: loop through people and assign range
unp = unique(person.obs\$PERSNR)
unplength = length(unp)
for(i in 1:unplength) {
print(unp[i])
person.obs[which(person.obs\$PERSNR==unp[i]),]\$n.obs =
1:length(person.obs[which(person.obs\$PERSNR==unp[i]),]\$n.obs)
i=i+1
gc()
}

# second approach: loop through rows and reset counter at new person
pnr = 0
for(i in 1:length(person.obs[,2])) {
if(pnr!=person.obs[i,]\$PERSNR) { pnr = person.obs[i,]\$PERSNR
e = 0
}
e=e+1
person.obs[i,]\$n.obs = e
i=i+1
gc()
}
``````

A few alternatives with the data.table and dplyr packages.

data.table:

``````library(data.table)
setDT(foo)[, rn := 1:.N, by = person]  # setDT(foo) is needed to convert to a data.table
``````

Or with the new `rowid` function (v1.9.7+, currently thus only available in the development version)

``````setDT(foo)[, rn := rowid(person)]
``````

both give:

``````> foo
person year rn
1:  pers1 1999  1
2:  pers1 2000  2
3:  pers1 2003  3
4:  pers2 1998  1
5:  pers2 2011  2
``````

If you want a true rank, you should use the `frank` function:

``````setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]
``````

dplyr:

``````library(dplyr)
# method 1
foo <- foo %>% group_by(person) %>% mutate(rn = row_number())
# method 2
foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())
``````

both giving a similar result:

``````> foo
Source: local data frame [5 x 3]
Groups: person [2]

person  year    rn
(fctr) (dbl) (int)
1  pers1  1999     1
2  pers1  2000     2
3  pers1  2003     3
4  pers2  1998     1
5  pers2  2011     2
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download