Vanbell - 5 months ago 12

R Question

I have n matrix in a list and an additional matrix which contain the value I want to find in the list of matrix.

To get the list of matrix, I use this code :

`setwd("C:\\~\\Documents\\R")`

import.multiple.txt.files<-function(pattern=".txt",header=T)

{

list.1<-list.files(pattern=".txt")

list.2<-list()

for (i in 1:length(list.1))

{

list.2[[i]]<-read.delim(list.1[i])

}

names(list.2)<-list.1

list.2

}

txt.import.matrix<-cbind(txt.import)

My list look like that: (I show only an example with n=2). The number of rows in each array is different (here I just take 5 and 6 rows to simplify but I have in my true data more than 500).

`txt.import.matrix[1]`

[[1]]

X. RT. Area. m.z.

1 1 1.01 2820.1 358.9777

2 2 1.03 9571.8 368.4238

3 3 2.03 6674.0 284.3294

4 4 2.03 5856.3 922.0094

5 5 3.03 27814.6 261.1299

txt.import.matrix[2]

[[2]]

X. RT. Area. m.z.

1 1 1.01 7820.1 358.9777

2 2 1.06 8271.8 368.4238

3 3 2.03 12674.0 284.3294

4 4 2.03 5856.6 922.0096

5 5 2.03 17814.6 261.1299

6 6 3.65 5546.5 528.6475

I have another array of values I want to find in the list of matrix. This array was obtained by combine all the array from the list in an array and removing the duplicates.

`reduced.list.pre.filtering`

RT. m.z.

1 1.01 358.9777

2 1.07 368.4238

3 2.05 284.3295

4 2.03 922.0092

5 3.03 261.1299

6 3.56 869.4558

I would like to obtain a new matrix where it is written the

`Area.`

`RT. ± 0.02`

`m.z. ± 0.0002`

`RT. m.z. Area.[1] Area.[2]`

1 1.01 358.9777 2820.1 7820.1

2 1.07 368.4238 8271.8

3 2.05 284.3295 6674.0 12674.0

4 2.03 922.0092 5856.3

5 3.03 261.1299 27814.6

6 3.65 528.6475

I have only an idea how to match only one exact value in one array. The difficulty here is to find the value in a list of array and need to find the value ± an interval. If you have any suggestion, I will be very grateful.

Answer

This is an alternative approach to Arun's rather elegant answer using `data.table`

. I decided to post it because it contains two additional aspects that are important considerations in your problem:

**Floating point comparison:**comparison to see if a floating point value is in an interval requires consideration of the round-off error in computing the interval. This is the general problem of comparing floating point representations of real numbers. See this and this in the context of R. The following implements this comparison in the function`in.interval`

.**Multiple matches:**your interval match criterion can result in multiple matches if the intervals overlap. The following**assumes**that you only want the first match (with respect to increasing rows of each`txt.import.matrix`

matrix). This is implemented in the function`match.interval`

and explained in the notes to follow. Other logic is needed if you want to get something like the average of the areas that match your criterion.

To find the matching row(s) in a matrix from `txt.import.matrix`

for each row in the matrix `reduced.list.pre.filtering`

, the following code vectorizes the application of the comparison function over the space of all enumerated pairs of rows between `reduced.list.pre.filtering`

and the matrix from `txt.import.matrix`

. Functionally for this application, this is the same as Arun's solution using `data.table`

's `non-equi`

joins; however, the `non-equi`

join feature is more general and the `data.table`

implementation is most likely better optimized for both memory usage and speed for even this application.

```
in.interval <- function(x, center, deviation, tol = .Machine$double.eps^0.5) {
return (abs(x-center) <= (deviation + tol))
}
match.interval <- function(r, t) {
r.rt <- rep(r[,1], each=nrow(t))
t.rt <- rep(t[,2], times=nrow(r))
r.mz <- rep(r[,2], each=nrow(t))
t.mz <- rep(t[,4], times=nrow(r)) ## 1.
ind <- which(in.interval(r.rt, t.rt, 0.02) &
in.interval(r.mz, t.mz, 0.0002))
r.ind <- floor((ind - 1)/nrow(t)) + 1 ## 2.
dup <- duplicated(r.ind)
r.ind <- r.ind[!dup]
t.ind <- ind[!dup] - (r.ind - 1)*nrow(t) ## 3.
return(cbind(r.ind,t.ind))
}
get.area.matched <- function(r, t) {
match.ind <- match.interval(r, t)
area <- rep(NA,nrow(r))
area[match.ind[,1]] <- t[match.ind[,2], 3] ## 4.
return(area)
}
res <- cbind(reduced.list.pre.filtering,
do.call(cbind,lapply(txt.import.matrix,
get.area.matched,
r=reduced.list.pre.filtering))) ## 5.
colnames(res) <- c(colnames(reduced.list.pre.filtering),
sapply(seq_len(length(txt.import.matrix)),
function(i) {return(paste0("Area.[",i,"]"))})) ## 6.
print(res)
## RT. m.z. Area.[1] Area.[2]
##[1,] 1.01 358.9777 2820.1 7820.1
##[2,] 1.07 368.4238 NA 8271.8
##[3,] 2.05 284.3295 6674.0 12674.0
##[4,] 2.03 922.0092 5856.3 NA
##[5,] 3.03 261.1299 27814.6 NA
##[6,] 3.56 869.4558 NA NA
```

Notes:

This part constructs the data to enable the vectorization of the application of the comparison function over the space of all enumerated pairs of rows between

`reduced.list.pre.filtering`

and the matrix from`txt.import.matrix`

. The data to be constructed are four arrays that are the replications (or expansions) of the two columns, used in the comparison criterion, of`reduced.list.pre.filtering`

in the row dimension of each matrix from`txt.import.matrix`

and the replications of the two columns, used in the comparison criterion, of each matrix from`txt.import.matrix`

in the row dimension of`reduced.list.pre.filtering`

. Here, the term array refers to either a 2-D matrix or a 1-D vector. The resulting four arrays are:`r.rt`

is the replication of the`RT.`

column of`reduced.list.pre.filtering`

(i.e.,`r[,1]`

) in the row dimension of`t`

`t.rt`

is the replication of the`RT.`

column of the matrix from`txt.import.matrix`

(i.e.,`t[,2]`

) in the row dimension of`r`

`r.mz`

is the replication of the`m.z.`

column of`reduced.list.pre.filtering`

(i.e.`r[,2]`

) in the row dimension of`t`

`t.mz`

is the replication of the`m.z.`

column of the matrix from`txt.import.matrix`

(i.e.`t[,4]`

) in the row dimension of`r`

What is important is that the indices for each of these arrays enumerate all pairs of rows in

`r`

and`t`

in the same manner. Specifically, viewing these arrays as 2-D matrices of size`M`

by`N`

where`M=nrow(t)`

and`N=nrow(r)`

, the row indices correspond to the rows of`t`

and the column indices correspond to the rows of`r`

. Consequently, the array values (over all four arrays) at the`i`

-th row and the`j`

-th column (of each of the four arrays) are the values used in the comparison criterion between the`j`

-th row of`r`

and the`i`

-th row of`t`

. Implementation of this replication process uses the R function`rep`

. For example, in computing`r.rt`

,`rep`

with`each=M`

is used, which has the effect of treating its array input`r[,1]`

as a row vector and replicating that row`M`

times to form`M`

rows. The result is such that each column, which corresponds to a row in`r`

, has the`RT.`

value from the corresponding row of`r`

and that value is the same for all rows (of that column) of`r.rt`

, each of which corresponds to a row in`t`

. This means that in comparing that row in`r`

to any row in`t`

, the value of`RT.`

from that row in`r`

is used. Conversely, in computing`t.rt`

,`rep`

with`times=N`

is used, which has the effect of treating its array input as a column vector and replicating that column`N`

times to form a`N`

columns. The result is such that each row in`t.rt`

, which corresponds to a row in`t`

, has the`RT.`

value from the corresponding row of`t`

and that value is the same for all columns (of that row) of`t.rt`

, each of which corresponds to a row in`r`

. This means that in comparing that row in`t`

to any row in`r`

, the value of`RT.`

from that row in`t`

is used. Similarly, the computations of`r.mz`

and`t.mz`

follow using the`m.z.`

column from`r`

and`t`

, respectively.This performs the vectorized comparison resulting in a

`M`

by`N`

logical matrix where the`i`

-th row and the`j`

-th column is`TRUE`

if the`j`

-th row of`r`

matches the criterion with the`i`

-th row of`t`

, and`FALSE`

otherwise. The output of`which()`

is the array of array indices to this logical comparison result matrix where its element is`TRUE`

. We want to convert these array indices to the row and column indices of the comparison result matrix to refer back to the rows of`r`

and`t`

. The next line extracts the column indices from the array indices. Note that the variable name is`r.ind`

to denote that these correspond to the rows of`r`

. We extract this first because it is important for detecting multiple matches for a row in`r`

.This part handles possible multiple matches in

`t`

for a given row in`r`

. Multiple matches will show up as duplicate values in`r.ind`

. As stated above, the logic here only keeps the first match in terms of increasing rows in`t`

. The function`duplicated`

returns all the indices of duplicate values in the array. Therefore removing these elements will do what we want. The code first removes them from`r.ind`

, then it removes them from`ind`

, and finally computes the column indices to the comparison result matrix, which corresponds to the rows of`t`

, using the pruned`ind`

and`r.ind`

. What is returned by`match.interval`

is a matrix whose rows are matched pair of row indices with its first column being row indices to`r`

and its second column being row indices to`t`

.The

`get.area.matched`

function simply uses the result from`match.ind`

to extract the`Area`

from`t`

for all matches. Note that the returned result is a (column) vector with length equaling to the number of rows in`r`

and initialized to`NA`

. In this way rows in`r`

that has no match in`t`

has a returned`Area`

of`NA`

.This uses

`lapply`

to apply the function`get.area.matched`

over the list`txt.import.matrix`

and append the returned matched`Area`

results to`reduced.list.pre.filtering`

as column vectors. Similarly, the appropriate column names are also appended and set in the result`res`

.

**Edit:** Alternative implementation using the `foreach`

package

In hindsight, a better implementation uses the `foreach`

package for vectorizing the comparison. In this implementation, the `foreach`

and `magrittr`

packages are required

```
require("magrittr") ## for %>%
require("foreach")
```

Then the code in `match.interval`

for vectorizing the comparison

```
r.rt <- rep(r[,1], each=nrow(t))
t.rt <- rep(t[,2], times=nrow(r))
r.mz <- rep(r[,2], each=nrow(t))
t.mz <- rep(t[,4], times=nrow(r)) # 1.
ind <- which(in.interval(r.rt, t.rt, 0.02) &
in.interval(r.mz, t.mz, 0.0002))
```

can be replaced by

```
ind <- foreach(r.row = 1:nrow(r), .combine=cbind) %:%
foreach(t.row = 1:nrow(t)) %do%
match.criterion(r.row, t.row, r, t) %>%
as.logical(.) %>% which(.)
```

where the `match.criterion`

is defined as

```
match.criterion <- function(r.row, t.row, r, t) {
return(in.interval(r[r.row,1], t[t.row,2], 0.02) &
in.interval(r[r.row,2], t[t.row,4], 0.0002))
}
```

This is easier to parse and reflects what is being performed. Note that what is returned by the nested `foreach`

combined with `cbind`

is again a logical matrix. Finally, the application of the `get.area.matched`

function over the list `txt.import.matrix`

can also be performed using `foreach`

:

```
res <- foreach(i = 1:length(txt.import.matrix), .combine=cbind) %do%
get.area.matched(reduced.list.pre.filtering, txt.import.matrix[[i]]) %>%
cbind(reduced.list.pre.filtering,.)
```

The complete code using `foreach`

is as follows:

```
require("magrittr")
require("foreach")
in.interval <- function(x, center, deviation, tol = .Machine$double.eps^0.5) {
return (abs(x-center) <= (deviation + tol))
}
match.criterion <- function(r.row, t.row, r, t) {
return(in.interval(r[r.row,1], t[t.row,2], 0.02) &
in.interval(r[r.row,2], t[t.row,4], 0.0002))
}
match.interval <- function(r, t) {
ind <- foreach(r.row = 1:nrow(r), .combine=cbind) %:%
foreach(t.row = 1:nrow(t)) %do%
match.criterion(r.row, t.row, r, t) %>%
as.logical(.) %>% which(.)
# which returns 1-D indices (row-major),
# convert these to 2-D indices in (row,col)
r.ind <- floor((ind - 1)/nrow(t)) + 1 ## 2.
# detect duplicates in r.ind and remove them from ind
dup <- duplicated(r.ind)
r.ind <- r.ind[!dup]
t.ind <- ind[!dup] - (r.ind - 1)*nrow(t) ## 3.
return(cbind(r.ind,t.ind))
}
get.area.matched <- function(r, t) {
match.ind <- match.interval(r, t)
area <- rep(NA,nrow(r))
area[match.ind[,1]] <- t[match.ind[,2], 3]
return(area)
}
res <- foreach(i = 1:length(txt.import.matrix), .combine=cbind) %do%
get.area.matched(reduced.list.pre.filtering, txt.import.matrix[[i]]) %>%
cbind(reduced.list.pre.filtering,.)
colnames(res) <- c(colnames(reduced.list.pre.filtering),
sapply(seq_len(length(txt.import.matrix)),
function(i) {return(paste0("Area.[",i,"]"))}))
```

Hope this helps.