Elizabeth Crutchley Elizabeth Crutchley - 1 month ago 8
R Question

Extract rows for the first occurrence of a variable in a group

I have a huge dataset (more than 2 million rows of over 100 variables; below is a small sample). For each

subj_trial
group, I want to find the first occurrence of each unique variable containing in ".wav" in
message
. It should be just containing, not ending (i.e. *.wav), because some rows have a bunch of information in the
message
fields (not pictured in the example, sorry).

It would be OK to output a data.frame that only had those three columns, but it's not necessary. I will later need to use the
timestamp
column for analyses.

I've found this question: Extract rows for the first occurrence of a variable in a data frame, but for the life of me I cannot work that example to fit mine.

Here's some sample data:

subj_trial message timestamp
1 1_1 message 459 755616
2 1_1 . 755618
3 1_1 test1.wav 755662
4 1_1 . 765712
5 1_1 test1.wav 767918
6 1_2 . 769342
7 1_2 test2.wav 775662
8 1_2 . 786412
9 1_2 test2.wav 797460
10 1_2 . 807626
11 1_3 test3.wav 817794
12 1_3 warning 11 827960
13 2_1 message 481 817313
14 2_1 test1.wav 817347
15 2_1 . 834959
16 2_1 test1.wav 855007
17 2_1 . 880107
18 2_2 . 895723
19 2_2 test2.wav 922671
20 2_2 . 958003
21 2_2 test2.wav 994385
22 2_3 . 1016217
23 2_3 test3.wav 1036899
24 2_3 . 1047331
25 2_3 test3.wav 1142527


This is a very small example of what I'm dealing with, here. For each
subj_trial
group there are probably 3000 lines, and there are over 700 groups.

Here's an example of what I'd like to have.

subj_trial message timestamp
1 1_1 test1.wav 755662
2 1_2 test2.wav 775662
3 1_3 test3.wav 817794
4 2_1 test1.wav 817347
5 2_2 test2.wav 922671
6 2_3 test3.wav 1036899


I've figured out how to get the unique values in
message
over the entire dataset by doing this:

unique_message <- df[match(unique(df$message), df$message),]


But I can't figure out how to do it by group. I've also tried using
group_by
in the
dplyr
package but can't get that to work, either. Have mercy and show me the way, friends. Thanks!

Answer

Using data.table:

library(data.table)
setDT(DT)
DT[,{
  id=head(grep(".wav",message),1)
  list(message=message[id],timestamp=timestamp[id])
},subj_trial]

#    subj_trial   message timestamp
# 1:        1_1 test1.wav    755662
# 2:        1_2 test2.wav    775662
# 3:        1_3 test3.wav    817794
# 4:        2_1 test1.wav    817347
# 5:        2_2 test2.wav    922671
# 6:        2_3 test3.wav   1036899