Vanbell Vanbell - 2 months ago 6
R Question

remove duplicate two criterion interval R

I am working on cleaning and processing of data with R. I would like to remove the duplicates from a matrix. See the example below.
I would like to remove duplicate according to two criterion, and if it is possible using an interval (If the RT ± 0.1 and the m.z ± 0.001 for a same row is detected more than one time in the table, so remove the extra row).

RT m.z
1 2.02 326.1988
2 2.03 326.1989
3 2.06 326.1990
4 2.03 331.1533
5 2.03 375.1785
6 2.03 301.2852
7 2.04 301.2852
8 2.06 301.2852
9 2.07 357.2609
10 2.07 308.0327
11 2.08 218.2221
12 2.08 312.3617
13 2.10 473.3453
14 2.15 388.3929


I would like a out put like that:

RT m.z
1 2.02 326.1988
2
3 2.06 326.1990
4 2.03 331.1533
5 2.03 375.1785
6 2.03 301.2852
7
8 2.06 301.2852
9 2.07 357.2609
10 2.07 308.0327
11 2.08 218.2221
12 2.08 312.3617
13 2.10 473.3453
14 2.15 388.3929


If you can help that will help me a lot.

Thanks in advance.

Answer

This is a way to do it with dplyr. Not sure if it's the most efficient way.

df <- read.table(textConnection("RT     m.z
1       2.02 326.1988
                                     2       2.03 326.1989
                                     3       2.06 326.1990
                                     4       2.03 331.1533
                                     5       2.03 375.1785
                                     6       2.03 301.2852
                                     7       2.04 301.2852
                                     8       2.06 301.2852
                                     9       2.07 357.2609
                                     10      2.07 308.0327
                                     11      2.08 218.2221
                                     12      2.08 312.3617
                                     13      2.10 473.3453
                                     14      2.15 388.3929"))

Now with the same data you provided.

library(dplyr)
# This calculates the difference in RT and m.z between consecutive rows
# and looks for absolute differences on which we filter further down the chain
df %>% mutate(
  rtdiff = abs(lag(RT) - RT),
  mzdiff = abs(lag(m.z) - m.z)
)  %>%
  # This replaces the NAs in the first row 
  #  with large values so filter does not have to deal with NAs
  mutate(rtdiff = replace(rtdiff, is.na(rtdiff), 999),
         mzdiff = replace(mzdiff, is.na(mzdiff), 999)) %>%
  # Remove the rows that don't meet your condition
  filter(!(rtdiff < 0.02 & mzdiff < 0.0002)) %>%
  # select only the columns you need and lose the rest
  select(RT, m.z)

giving us:

    RT      m.z
1  2.02 326.1988
2  2.06 326.1990
3  2.03 331.1533
4  2.03 375.1785
5  2.03 301.2852
6  2.06 301.2852
7  2.07 357.2609
8  2.07 308.0327
9  2.08 218.2221
10 2.08 312.3617
11 2.10 473.3453
12 2.15 388.3929