Mohamed Ezzeddine Macherki - 1 month ago 3x
R Question

# number of occurence of two letters separted by a finit gap

Lets we have a string

`Seq`
of size
`N`
and contain
`z`
alphabet { A,B,C….}.
I have to perform a function that calculates number of occurrence of letter
`i`
which is separated with
`l`
letters (gap) form the letter
`j`
:
Example:

``````I=A, j=T and gap=10
That have to find the number of occurrence of AT,A-T,A--T,A---T,A----T,,,A---------T(- is any alphabet)
``````

I used this code for DNA sequence (alphabet
`z`
=4) and suppose that is N(L*N)
complexity:

``````inline const short int V (const char x){
switch(x){
case 'A':case 'a':
return 0;
break;
case 'C':case 'c':
return 1;
break;
case 'G':case 'g':
return 2;
break;
default:
return 3;
break;
}

}

// [[Rcpp::export]]
IntegerVector basem(std::string seq ,int gap ,int N){
IntegerVector res(16);
short int x=0;
short int y=0;
for(int i=0;i<N-gap;i++){
x=V(seq[i]);
y=V(seq[i+gap]);
res[4*x+y]++;
}
return res;
}

// [[Rcpp::export]]
List basegap (std::string sequence,int x){
int n=sequence.size();
List result ;
for(int j=1;j<x;j++) {
result.push_back(basem(sequence,j,n));
}
return result;
}
``````

basem:calculate the number of occurence for a gap

basegap: calculate the number of occurence for a gap from 1 to X

Running Example:

``````a
[1] "atgccatcactcagtaaagaagcggccctggttcatgaagcgttagttgcgcgaggactggaaacaccgctgcgcccgcccgtgcatgaaatggataacgaaacgccgaaaagccttattgctggtcatatgaccgaaatcatgcagctgctgaatctcgacctggctgatgacagtttgatggaaacgcggcatcgcatcgctaaaatgtatgtcgatgaaattttctccggtctggattacgccaatttcccgaaaatcaccctcattgaaaacaaaatgaaggtcgatgaaatggtcaccgtccgcgatatcactctgaccagcacctgtgaacaccattttgttaccatcgatggcaaagcgacggtggcctatatcccgaaagattcggtgatcggtctgtcaaaaattaaccgcattgtgcagttctttgcccagcgtccgcaggtgcaggaacgtctgacgcagcaaattcttgttgccgcgctacaaacgctgctgggcaccaataacgtggctgtctcgatcgacgcggtgcattactgcgtgaaggcgcgtggcatccgcgatgcaaccagtgccacgacaacgtcctctcttggtggattgttcaaatccagtcagaatacgcgccacgagtttctgcgcgctgtgcgtcatcacaactga"

basem(a,5,nchar(a))
[1] 40 50 42 37 47 46 36 49 45 39 44 39 38 42 45 28
x<-basegap(a,10)
[[1]]
[1] 61 38 23 48 48 39 57 35 44 58 28 38 17 44 60 33

[[2]]
[1] 42 47 43 38 42 49 53 35 44 44 35 44 42 39 37 36

[[3]]
[1] 50 39 44 37 35 53 44 47 46 41 47 33 39 46 32 36

[[4]]
[1] 42 54 36 38 54 36 52 36 43 49 41 34 31 39 38 45

[[5]]
[1] 40 50 42 37 47 46 36 49 45 39 44 39 38 42 45 28

[[6]]
[1] 42 44 40 42 45 44 44 45 38 44 47 38 44 45 36 28

[[7]]
[1] 38 44 44 42 44 49 42 42 40 44 42 41 47 40 39 27

[[8]]
[1] 34 48 47 38 46 49 41 41 47 39 39 42 42 40 40 31

[[9]]
[1] 43 37 45 42 40 46 56 34 45 57 33 32 40 36 33 44

m<-matrix(unlist(x),nrow=16)
m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,]   61   42   50   42   40   42   38   34   43
[2,]   38   47   39   54   50   44   44   48   37
[3,]   23   43   44   36   42   40   44   47   45
[4,]   48   38   37   38   37   42   42   38   42
[5,]   48   42   35   54   47   45   44   46   40
[6,]   39   49   53   36   46   44   49   49   46
[7,]   57   53   44   52   36   44   42   41   56
[8,]   35   35   47   36   49   45   42   41   34
[9,]   44   44   46   43   45   38   40   47   45
[10,]   58   44   41   49   39   44   44   39   57
[11,]   28   35   47   41   44   47   42   39   33
[12,]   38   44   33   34   39   38   41   42   32
[13,]   17   42   39   31   38   44   47   42   40
[14,]   44   39   46   39   42   45   40   40   36
[15,]   60   37   32   38   45   36   39   40   33
[16,]   33   36   36   45   28   28   27   31   44
``````

My questions are:

1. How to reduce the complexity for this calculation?

2. Are there any ideas to improve this code?

This can be done in `O(N log N).` The algorithm is as follows:

1. Store the indexes of each letter `i` in a sorted manner. `O(N)`

2. For every index with the letter `j`, binary search for the number of indexes of the letter `i` in the range. `O(N log N)`

In C++, this can be done by subtracting iterators. For example, in 1 iteration:

``````vector<int> indices;   <- Store indices of letter i
int index;   <- Index of a letter j
vector<int>:iterator it1 = upper_bound(indices.begin(),indices.end(),index+L);  <- Position just above the last index
vector<int>:iterator it2 = lower_bound(indices.begin(),indices.end(),index-L);  <- Position at or below the first index
int num = it1 - it2;  <- number of indices with the letter i in the range
``````

For other languages, there might be other methods of doing this, but since I am not very well versed in terms of different programming languages, I can only give a C++ example. Basically you want to subtract the larger index by the smaller index.

You might want to look up some other applications of binary search as well.