Vanbell - 1 month ago 4x
R Question

# remove duplicate two criterion interval R

I am working on cleaning and processing of data with R. I would like to remove the duplicates from a matrix. See the example below.
I would like to remove duplicate according to two criterion, and if it is possible using an interval (If the RT ± 0.1 and the m.z ± 0.001 for a same row is detected more than one time in the table, so remove the extra row).

``````        RT     m.z
1       2.02 326.1988
2       2.03 326.1989
3       2.06 326.1990
4       2.03 331.1533
5       2.03 375.1785
6       2.03 301.2852
7       2.04 301.2852
8       2.06 301.2852
9       2.07 357.2609
10      2.07 308.0327
11      2.08 218.2221
12      2.08 312.3617
13      2.10 473.3453
14      2.15 388.3929
``````

I would like a out put like that:

``````        RT     m.z
1       2.02 326.1988
2
3       2.06 326.1990
4       2.03 331.1533
5       2.03 375.1785
6       2.03 301.2852
7
8       2.06 301.2852
9       2.07 357.2609
10      2.07 308.0327
11      2.08 218.2221
12      2.08 312.3617
13      2.10 473.3453
14      2.15 388.3929
``````

If you can help that will help me a lot.

This is a way to do it with `dplyr`. Not sure if it's the most efficient way.

``````df <- read.table(textConnection("RT     m.z
1       2.02 326.1988
2       2.03 326.1989
3       2.06 326.1990
4       2.03 331.1533
5       2.03 375.1785
6       2.03 301.2852
7       2.04 301.2852
8       2.06 301.2852
9       2.07 357.2609
10      2.07 308.0327
11      2.08 218.2221
12      2.08 312.3617
13      2.10 473.3453
14      2.15 388.3929"))
``````

Now with the same data you provided.

``````library(dplyr)
# This calculates the difference in RT and m.z between consecutive rows
# and looks for absolute differences on which we filter further down the chain
df %>% mutate(
rtdiff = abs(lag(RT) - RT),
mzdiff = abs(lag(m.z) - m.z)
)  %>%
# This replaces the NAs in the first row
#  with large values so filter does not have to deal with NAs
mutate(rtdiff = replace(rtdiff, is.na(rtdiff), 999),
mzdiff = replace(mzdiff, is.na(mzdiff), 999)) %>%
# Remove the rows that don't meet your condition
filter(!(rtdiff < 0.02 & mzdiff < 0.0002)) %>%
# select only the columns you need and lose the rest
select(RT, m.z)
``````

giving us:

``````    RT      m.z
1  2.02 326.1988
2  2.06 326.1990
3  2.03 331.1533
4  2.03 375.1785
5  2.03 301.2852
6  2.06 301.2852
7  2.07 357.2609
8  2.07 308.0327
9  2.08 218.2221
10 2.08 312.3617
11 2.10 473.3453
12 2.15 388.3929
``````
Source (Stackoverflow)