MariKo MariKo - 3 months ago 9
R Question

Subset first value within certain groups in R

I have a big data set ss.eyes. Its short version looks like this:

subjectNo blockNo TrialNo Word
1 1 1 never
1 1 1 lacerates
1 1 1 knowledge
1 1 1 accumulates
1 1 2 actress
1 1 2 dietician
1 1 2 pricey
1 1 2 destroys
1 2 1 tornado
1 2 1 fortunate
1 2 2 plumber
1 2 2 combination
1 2 2 renews
2 1 1 faucet
2 1 1 memorizes
2 1 1 regularly
2 1 2 always
2 1 2 sycophant
2 1 2 huge
2 2 1 attracts
2 2 1 pleases
2 2 1 cargo
2 2 1 patron
2 2 2 magnet
2 2 2 deer
2 2 2 sometimes


I want to subset the first values of ss.eyes$Word of each TrialNo per each BlockNo and each subjectNo, so it would look like this:

subjectNo blockNo TrialNo Word
1 1 1 never
1 1 2 actress
1 2 1 tornado
1 2 2 plumber
2 1 1 faucet
2 1 2 always
2 2 1 attracts
2 2 2 magnet


I wrote a short code that should do the work. However, it returns some missing values in TrialNo, thus, some data are missing:

test <- ss.eyes %>% group_by(subjectNo, blockNo, TrialNo) %>%
summarise(first=head(Word,1), count=n_distinct(Word))


Is there another (more efficient?..) way how to do that?

Answer

We can use slice after grouping by 'subjectNo', 'blockNo' and 'TrialNo' in dplyr.

library(dplyr)
ss.eyes %>%
     group_by(subjectNo, blockNo, TrialNo) %>% 
     slice(1)
#   subjectNo blockNo TrialNo     Word
#      <int>   <int>   <int>    <chr>
#1         1       1       1    never
#2         1       1       2  actress
#3         1       2       1  tornado
#4         1       2       2  plumber
#5         2       1       1   faucet
#6         2       1       2   always
#7         2       2       1 attracts
#8         2       2       2   magnet

Or with data.table

library(data.table)
setDT(ss.eyes)[, head(.SD, 1) , by = .(subjectNo, blockNo, TrialNo)]
#   subjectNo blockNo TrialNo     Word
#1:         1       1       1    never
#2:         1       1       2  actress
#3:         1       2       1  tornado
#4:         1       2       2  plumber
#5:         2       1       1   faucet
#6:         2       1       2   always
#7:         2       2       1 attracts
#8:         2       2       2   magnet

Or using ave from base R

ss.eyes[with(ss.eyes, ave(Word, subjectNo, blockNo, TrialNo, FUN= seq_along)<2),]

Or with duplicated

ss.eyes[!duplicated(ss.eyes[1:3]),]

Benchmarks

As the OP mentioned about efficiency in the post, we can compare the methods

set.seed(24)
df1 <- data.frame(subjectNo = sample(1:1e5, 1e6, replace=TRUE), 
                blockNo = sample(1:1e3, 1e6, replace=TRUE),
                TrialNo = sample(1:1e2, 1e6, replace=TRUE),
                Word = sample(LETTERS, 1e6, replace= TRUE),
                 stringsAsFactors=FALSE)
df2 <- df1[do.call(order, df1[1:3]),]

system.time(df1 %>%
          group_by(subjectNo, blockNo, TrialNo) %>% 
          slice(1))
#    user  system elapsed 
#   5.30    0.08    5.37 
system.time(setDT(df1)[, head(.SD, 1) , by = .(subjectNo, blockNo, TrialNo)])
#   user  system elapsed 
#    0.3     0.0     0.3 

If we order the dataset by the first three columns i.e. 'df2'

system.time(df2 %>%
           group_by(subjectNo, blockNo, TrialNo) %>% 
           slice(1))
#   user  system elapsed 
#  4.47    0.09    4.57 

system.time(setDT(df2)[, head(.SD, 1) , by = .(subjectNo, blockNo, TrialNo)])
#   user  system elapsed 
#   0.05    0.02    0.07 

system.time(df2[!duplicated(df2[1:3]),])
#   user  system elapsed 
#  0.03    0.00    0.03