AeonRed AeonRed - 2 months ago 6
R Question

R For loop fails applying max function

I premise I'm new with R and actually I'm trying to get the fundamentals.
Currently I'm workin on a large dataframe (called "ppl") which I have to edit in order to filter some rows. Each row is included in a group and it is characterized by an intensity (into) value and a sample value.

mz rt into sample tracker sn grp
100.0153 126 2.762664 3 11908 7.522655 0
100.0171 127 2.972048 2 5308 7.718521 0
100.0788 272 30.217969 2 5309 19.024807 1
100.0796 272 17.277916 3 11910 7.297716 1
101.0042 128 37.557324 3 11916 27.991320 2
101.0043 128 39.676014 2 5316 28.234918 2


Well, the first question is: "How can I select from each group the sample with the highest intensity?"
I tried a for loop:

for (i in ppl$grp) {
temp<-ppl[ppl$grp == i,]
sel<-rbind(sel,temp[max(temp$into),])
}


The fact is that it works for ppl$grp == 0, but the next cycles return NAs rows.
Then the filtered dataframe(called "sel") also should store the sample values of the removed rows. It should be as follows:

mz rt into sample tracker sn grp
100.0171 127 2.972048 c(2,3) 5308 7.718521 0
100.0788 272 30.217969 c(2,3) 5309 19.024807 1
101.0043 128 39.676014 c(2,3) 5316 28.234918 2


In order to get this I would use this approach:

lev<-factor(ppl$grp)
samp<-ppl$sample
samp2<-split(samp,lev)
sel$sample<-samp2


Any hint? Because I cannot test it since I still don't have solved the previous problem.

Thanks a lot.

Answer

A base R option using ave is

ppl[with(ppl, ave(into, grp, FUN = max)==into),]

If the 'sample' column in the expected output have the unique elements in each 'grp', then after grouping by 'grp', update the 'sample' as the pasted unique elements of 'sample', then arrange the 'into' descendingly and slice the 1st row.

library(dplyr)
ppl %>%
    group_by(grp) %>% 
    mutate(sample = toString(sort(unique(sample)))) %>% 
    arrange(desc(into)) %>%
    slice(1L)
#       mz    rt      into sample tracker        sn   grp
#     <dbl> <int>     <dbl>  <chr>   <int>     <dbl> <int>
#1 100.0171   127  2.972048   2, 3    5308  7.718521     0
#2 100.0788   272 30.217969   2, 3    5309 19.024807     1
#3 101.0043   128 39.676014   2, 3    5316 28.234918     2