gugy gugy - 3 months ago 14
R Question

R - make it faster: check matrixpositions for characters and put info into list (0/1)

so I have this code snippet that does what it is supposed to do but it is super slow and probably inefficient due to the use of for loops...
And because I am using it on huge files it slows down my script considerably.

I am guessing R has a built in function which easily does what I am doing in for loops?

Does anyone have an idea how to make it faster?

what the code below does:

checks if at a position in the matrix, a character of the alphabet is present (1) or if it is another character (0). This info is then saved in a list.

Basically what I need to continue with is a true/false for the matrix for alphabet characters.
I then use the true/false list for "renumbering the matrix elements" (so that the non-alphabet characters are not counted)

UPDATE:



what I mean by
"renumbering the matrix elements":
protein sequences are always numbered, so a protein of length 560 has 560 amino acids in its sequence. I you make an alignment of sequences, and their lengths are not identical (A:560 amino acids, B: 600 amino acids), the alignment will introduce gaps where the sequences do not match. My matrix is an alignment and has therefore gaps (non-alphabet characters, usually "- ") To be able to address position 100 of sequence A in the alignment, I need to renumber the alignment so that only "non-gap positions" have a number and then address that position. Otherwise, if I print position 100 of the alignment, it will not be position 100 of my sequence A.

FYI:
This is for protein sequence alignments, and I want all the amino acids (alphabet characters) to be numbered, but not the gaps (other characters like "-" or "."). this later enables me to adress the positions where amino acids are specifically and analyse my huge alignments easier

If clarifications are needed please comment!

MSAmatrix<-matrix(c("A","-","B", "-", "C","A","D","B", "-", "C","A","-","B", "F", "C","A","D",".", "-", "C"), nrow=4, byrow=TRUE)

letters<-list()
lettersrenumbered<-list()
referencesequence<-1
# for whatever reason I am initialising the lists wrong and they need to be filled with 1 element before I can use them in the next loops...
for(i in 1:dim(MSAmatrix)[1]) {
letters[[i]]<-1313
lettersrenumbered[[i]]<-1313
}
# get info if position is an alphabet character or not
for(i in 1:dim(MSAmatrix)[1]) {
for(j in 1:dim(MSAmatrix)[2]) {
if(grepl("[a-zA-Z]",MSAmatrix[i,])[j]){
letters[[i]][j]<-1
}
else{
letters[[i]][j]<-0
}
}
}

#renumber all the sequences so that only the alphabet characters get a number
for(i in 1:dim(MSAmatrix)[1]) {
count<-0
for(j in 1:dim(MSAmatrix)[2]) {
if(letters[[i]][j]==1){
count<-count+1
lettersrenumbered[[i]][j]<-count
}
else{
lettersrenumbered[[i]][j]<-" "
}
}
}

Answer

On my machine the following is around 20 times faster than your method:

Create a matrix of the same dimensions, but all false

X <- matrix(rep(FALSE, 20), nrow = 4, byrow = TRUE)

Where the MSAmatrix is a capital letter, mark it as TRUE

X[MSAmatrix %in% LETTERS] <- TRUE

You can eke out a bit more speed (30%) by just creating the matrix directly, though it may be a little harder to assure yourself that it's correct. That is, by just:

matrix(MSAmatrix %in% LETTERS, nrow = 4, byrow = FALSE)

It's currently unclear what you mean by "renumbering the matrix elements", but if you use apply and cumsum

apply(X, 2, cumsum)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    1    0    1
[2,]    2    1    2    0    2
[3,]    3    1    3    1    3
[4,]    4    2    3    1    4

I think you get close to what you intend.