iraserd - 1 year ago 88
R Question

# R - Keep first observation per group identified by multiple variables (Stata equivalent "bys var1 var2 : keep if _n == 1")

So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.

Using the data.frame below, the result I want is to obtain exactly the first observation per group, while groups are formed by multiple variables and have to be sorted by another variable, i.e. the data.frame mydata obtained by:

``````id <- c(1,1,1,1,2,2,3,3,4,4,4)
day <- c(1,1,2,3,1,2,2,3,1,2,3)
value <- c(12,10,15,20,40,30,22,24,11,11,12)
mydata <- data.frame(id, day, value)
``````

Should be transformed to:

``````   id day value
1   1    10
1   2    15
1   3    20
2   1    40
2   2    30
3   2    22
3   3    24
4   1    11
4   2    11
4   3    12
``````

By keeping only one of the rows with one or multiple duplicate group-identificators (here that is only
`row[1]: (id,day)=(1,1))`
, sorting for value first (so that the row with the lowest value is kept).

In Stata, this would simply be:

``````bys id day (value): keep if _n == 1
``````

I found a piece of code on the web, which properly does that if I first produce a single group identifier :

``````mydata\$id1 <- paste(mydata\$id,"000",mydata\$day, sep="")  ### the single group identifier

myid.uni <- unique(mydata\$id1)
a<-length(myid.uni)

last <- c()

for (i in 1:a) {
temp<-subset(mydata, id1==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}

last
``````

However, there are a few problems with this approach:

1. A single identifier needs to be created (which is quickly done).

2. It seems like a cumbersome piece of code compared to the single line of code in Stata.

3. On a medium-sized dataset (below 100,000 observations grouped in lots of about 6), this approach would take about 1.5 hours.

Is there any efficient equivalent to Stata's
`bys var1 var2: keep if _n == 1`
?

I would order the `data.frame` at which point you can look into using `by`:

``````mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]

do.call(rbind, by(mydata, list(mydata\$id, mydata\$day),
``````

Alternatively, look into the "data.table" package. Continuing with the ordered `data.frame` from above:

``````library(data.table)

DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
#     id day value
#  1:  1   1    10
#  2:  1   2    15
#  3:  1   3    20
#  4:  2   1    40
#  5:  2   2    30
#  6:  3   2    22
#  7:  3   3    24
#  8:  4   1    11
#  9:  4   2    11
# 10:  4   3    12
``````

Or, starting from scratch, you can use `data.table` in the following way:

``````DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]
``````

And, by extension, in base R:

``````Ranks <- with(mydata, ave(value, id, day, FUN = function(x)
rank(x, ties.method="first")))
mydata[Ranks == 1, ]
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download