Anthony Damico - 1 year ago 63

R Question

my apologies if this is a duplicate. this seems like a question that SO would've answered long ago, but i did quite a bit of searching and couldn't find anything specifically answering this. there are lots of related questions that could be used to answer this, but i figured it should be answered formally.

this is in response to this question asked on the r-help mailing list.

here are *lots* of examples of how to do this using

`sql`

`sqldf`

main question: using the example

`mtcars`

if you open up R and type

`mtcars`

`cyl`

`cyl`

`mpg cyl disp hp drat wt qsec vs am gear carb ranks`

Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 2.0

Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 1.0

Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 2.0

Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 3.0

Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1.0

Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 1.5

Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1.5

Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 3.0

Answer Source

```
# start with the mtcars data frame (included with your installation of R)
mtcars
# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )
# choose whether you want to find the minimum or maximum
find.maximum <- FALSE
# create a simple data frame with only two columns
x <- mtcars
# order it based on
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]
# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# done!
# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]
# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties. using `max` would *exclude* all ties
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.
```