Mohamed Ezzeddine Macherki Mohamed Ezzeddine Macherki - 2 months ago 7
R Question

number of occurence of two letters separted by a finit gap

Lets we have a string

Seq
of size
N
and contain
z
alphabet { A,B,C….}.
I have to perform a function that calculates number of occurrence of letter
i
which is separated with
l
letters (gap) form the letter
j
:
Example:

I=A, j=T and gap=10
That have to find the number of occurrence of AT,A-T,A--T,A---T,A----T,,,A---------T(- is any alphabet)


I used this code for DNA sequence (alphabet
z
=4) and suppose that is N(L*N)
complexity:

inline const short int V (const char x){
switch(x){
case 'A':case 'a':
return 0;
break;
case 'C':case 'c':
return 1;
break;
case 'G':case 'g':
return 2;
break;
default:
return 3;
break;
}

}

// [[Rcpp::export]]
IntegerVector basem(std::string seq ,int gap ,int N){
IntegerVector res(16);
short int x=0;
short int y=0;
for(int i=0;i<N-gap;i++){
x=V(seq[i]);
y=V(seq[i+gap]);
res[4*x+y]++;
}
return res;
}

// [[Rcpp::export]]
List basegap (std::string sequence,int x){
int n=sequence.size();
List result ;
for(int j=1;j<x;j++) {
result.push_back(basem(sequence,j,n));
}
return result;
}



basem:calculate the number of occurence for a gap

basegap: calculate the number of occurence for a gap from 1 to X


Running Example:

a
[1] "atgccatcactcagtaaagaagcggccctggttcatgaagcgttagttgcgcgaggactggaaacaccgctgcgcccgcccgtgcatgaaatggataacgaaacgccgaaaagccttattgctggtcatatgaccgaaatcatgcagctgctgaatctcgacctggctgatgacagtttgatggaaacgcggcatcgcatcgctaaaatgtatgtcgatgaaattttctccggtctggattacgccaatttcccgaaaatcaccctcattgaaaacaaaatgaaggtcgatgaaatggtcaccgtccgcgatatcactctgaccagcacctgtgaacaccattttgttaccatcgatggcaaagcgacggtggcctatatcccgaaagattcggtgatcggtctgtcaaaaattaaccgcattgtgcagttctttgcccagcgtccgcaggtgcaggaacgtctgacgcagcaaattcttgttgccgcgctacaaacgctgctgggcaccaataacgtggctgtctcgatcgacgcggtgcattactgcgtgaaggcgcgtggcatccgcgatgcaaccagtgccacgacaacgtcctctcttggtggattgttcaaatccagtcagaatacgcgccacgagtttctgcgcgctgtgcgtcatcacaactga"

basem(a,5,nchar(a))
[1] 40 50 42 37 47 46 36 49 45 39 44 39 38 42 45 28
x<-basegap(a,10)
[[1]]
[1] 61 38 23 48 48 39 57 35 44 58 28 38 17 44 60 33

[[2]]
[1] 42 47 43 38 42 49 53 35 44 44 35 44 42 39 37 36

[[3]]
[1] 50 39 44 37 35 53 44 47 46 41 47 33 39 46 32 36

[[4]]
[1] 42 54 36 38 54 36 52 36 43 49 41 34 31 39 38 45

[[5]]
[1] 40 50 42 37 47 46 36 49 45 39 44 39 38 42 45 28

[[6]]
[1] 42 44 40 42 45 44 44 45 38 44 47 38 44 45 36 28

[[7]]
[1] 38 44 44 42 44 49 42 42 40 44 42 41 47 40 39 27

[[8]]
[1] 34 48 47 38 46 49 41 41 47 39 39 42 42 40 40 31

[[9]]
[1] 43 37 45 42 40 46 56 34 45 57 33 32 40 36 33 44

m<-matrix(unlist(x),nrow=16)
m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 61 42 50 42 40 42 38 34 43
[2,] 38 47 39 54 50 44 44 48 37
[3,] 23 43 44 36 42 40 44 47 45
[4,] 48 38 37 38 37 42 42 38 42
[5,] 48 42 35 54 47 45 44 46 40
[6,] 39 49 53 36 46 44 49 49 46
[7,] 57 53 44 52 36 44 42 41 56
[8,] 35 35 47 36 49 45 42 41 34
[9,] 44 44 46 43 45 38 40 47 45
[10,] 58 44 41 49 39 44 44 39 57
[11,] 28 35 47 41 44 47 42 39 33
[12,] 38 44 33 34 39 38 41 42 32
[13,] 17 42 39 31 38 44 47 42 40
[14,] 44 39 46 39 42 45 40 40 36
[15,] 60 37 32 38 45 36 39 40 33
[16,] 33 36 36 45 28 28 27 31 44


My questions are:


  1. How to reduce the complexity for this calculation?

  2. Are there any ideas to improve this code?


Answer

This can be done in O(N log N). The algorithm is as follows:

  1. Store the indexes of each letter i in a sorted manner. O(N)

  2. For every index with the letter j, binary search for the number of indexes of the letter i in the range. O(N log N)

In C++, this can be done by subtracting iterators. For example, in 1 iteration:

vector<int> indices;   <- Store indices of letter i
int index;   <- Index of a letter j
vector<int>:iterator it1 = upper_bound(indices.begin(),indices.end(),index+L);  <- Position just above the last index 
vector<int>:iterator it2 = lower_bound(indices.begin(),indices.end(),index-L);  <- Position at or below the first index
int num = it1 - it2;  <- number of indices with the letter i in the range

For other languages, there might be other methods of doing this, but since I am not very well versed in terms of different programming languages, I can only give a C++ example. Basically you want to subtract the larger index by the smaller index.

You might want to look up some other applications of binary search as well.