Writer316 Writer316 - 1 month ago 8
Ruby Question

How to write a method that counts the most common substring in a string in ruby?

I have this program with a class DNA. The program counts the most frequent k-mer in a string. So, it is looking for the most common substring in a string with a length of k.

An example would be creating a dna1 object with a string of AACCAATCCG. The count k-mer method will look for a subtring with a length of k and output the most common answer. So, if we set k = 1 then 'A' and 'C' will be the most occurrence in the string because it appears four times. See example below:

dna1 = DNA.new('AACCAATCCG')
=> AACCAATCCG
>> dna1.count_kmer(1)
=> [#<Set: {"A", "C"}>, 4]
>> dna1.count_kmer(2)
=> [#<Set: {"AA", "CC"}>, 2]


Here is my DNA class :

class DNA
def initialize (nucleotide)
@nucleotide = nucleotide
end
def length
@nucleotide.length
end
protected
attr_reader :nucleotide
end


Here is my count kmer method that I am trying to implement:

# I have k as my only parameter because I want to pass the nucleotide string in the method
def count_kmer(k)

# I created an array as it seems like a good way to split up the nucleotide string.
counts = []

#this tries to count how many kmers of length k there are
num_kmers = self.nucleotide.length- k + 1

#this should try and look over the kmer start positions
for i in num_kmers

#Slice the string, so that way we can get the kmer
kmer = self.nucleotide.split('')
end

#add kmer if its not present
if !kmer = counts
counts[kmer] = 0

#increment the count for kmer
counts[kmer] +=1
end

#return the final count
return counts
end

#end dna class
end


I'm not sure where my method went wrong.

Answer

Code

def most_frequent_substrings(str, k)
  (0..str.size-k).each_with_object({}) do |i,h|
    b = [] 
    str[i..-1].scan(Regexp.new str[i,k]) { b << Regexp.last_match.begin(0) + i }
    (h[b.size] ||= []) << b
  end.max_by { |k,_| k }.last.each_with_object({}) { |a,h| h[str[a.first,k]] = a } 
end

Example

str = "ABBABABBABCATSABBABB"
most_frequent_substrings(str, 4)
  #=> {"ABBA"=>[0, 5, 14], "BBAB"=>[1, 6, 15]}

This shows that the most frequently-occurring 4-character substring of strappears 3 times. There are two such substrings: "ABBA" and "BBAB". "ABBA" begins at offsets (into str) 0, 5 and 14, "BBAB" substrings begin at offsets 1, 6 and 15.

Explanation

For the example above the steps are as follows.

k = 4
n = str.size - k
  #=> 20 - 4 => 16
e = (0..n).each_with_object([])
  #<Enumerator: 0..16:each_with_object([])> 

We can see the values that will be generated by this enumerator this by converting e to an array.

e.to_a
  #=> [[0, []], [1, []], [2, []], [3, []], [4, []], [5, []], [6, []], [7, []], [8, []],
 #     [9, []], [10, []], [11, []], [12, []], [13, []], [14, []], [15, []], [16, []]]

Note the empty array contained in each element will be modified as the array is built. Continuing, the first element of e is passed to the block and the block variables are assigned using parallel assignment:

i,a = e.next
  #=> [0, []] 
i #=> 0 
a #=> [] 

We are now considering the substring of size 4 that begins at str offset i #=> 0, which is seen to be "ABBA". Now the block calculation is performed.

b = []
r = Regexp.new str[i,k]
  #=> Regexp.new str[0,4]
  #=> Regexp.new "ABBA"
  #=> /ABAB/
str[i..-1].scan(r) { b << Regexp.last_match.begin(0) + i }
  #=> "ABBABABBABCATSABBABB".scan(r) { b << Regexp.last_match.begin(0) + i } 
b #=> [0, 5, 14]

We next have

(h[b.size] ||= []) << b

which becomes

(h[b.size] = h[b.size] ||= []) << b
  #=> (h[3] = h[3] || []) <<  [0, 5, 14]

Since h has no key 3, h[3] on the right side equals nil. Continuing,

  #=> (h[3] = nil || []) <<  [0, 5, 14]
  #=> (h[3] = []) <<  [0, 5, 14]
h #=> { 3=>[[0, 5, 14]] }

Notice that we throw away scan's return value. All we need is b

This tells us the "ABBA" appears thrice in str, beginning at offsets 0, 5 and 14.

Now observe

e.to_a
  #=> [[0, [[0, 5, 14]]],  [1, [[0, 5, 14]]],  [2, [[0, 5, 14]]],
  #    ...
  #    [16, [[0, 5, 14]]]]

After all elements of e have been passed to the block, the block returns

h #=> {3=>[[0, 5, 14], [1, 6, 15]],
  #    1=>[[2], [3], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]],
  #    2=>[[4, 16], [5, 14], [6, 15]]} 

Consider substrings that appear just once: h[1]. One of those is [2]. This pertains to the 4-character substring beginning at str offset 2:

str[2,4]
  #=> "BABA"

That is found to be the only instance of that substring. Similarly, among the substrings that appear twice is str[4,4] = str[16,4] #=> "BABB", given by h[2][0] #=> [4, 16].

Next we determine the greatest frequency of a substring of length 4:

c = h.max_by { |k,_| k }
  #=> [3, [[0, 5, 14], [1, 6, 15]]] 
d = c.last
  #=> [[0, 5, 14], [1, 6, 15]]

For convenience, convert d to a hash:

d.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
  #=> {"ABBA"=>[0, 5, 14], "BBAB"=>[1, 6, 15]}

and return that hash from the method.

There is one detail that mention. It is possible that d will contain two or more arrays that reference the same substring, in which case the value of the associated key (the substring) will equal the last of those arrays. Here's a simple example.

str = "AAA"
k = 2

In this case the array c above will equal

c = [[0], [1]]

Both of these reference str[0,2] #=> str[1,2] #=> "AA". In building the hash the first is overwritten by the second:

c.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
  #=> {"AA"=>[1]}