Nick D Nick D - 1 month ago 6
Ruby Question

How can I generate a percentage for a regex string match in Ruby?

I'm trying to build a simple method to look at about 100 entries in a database for a last name and pull out all the ones that match above a specific percentage of letters. My current approach is:


  1. Pull all 100 entries from the database into an array

  2. Iterate through them while performing the following action

  3. Split the last name into an array of letters

  4. Subtract that array from another array that contains the letters for the name I am trying to match which leaves only the letters that weren't matched.

  5. Take the size of the result and divide by the original size of the array from step 3 to get a percentage.

  6. If the percentage is above a predefined threshold, push that database object into a results array.



This works, but I feel like there must be some cool ruby/regex/active record method of doing this more efficiently. I have googled quite a bit but can't find anything.

Answer

To comment on the merit of the measure you suggested would require speculation, which is out-of-bounds at SO. I therefore will merely demonstrate how you might implement your proposed approach.

Code

First define a helper method:

class Array
  def difference(other)
    h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
    reject { |e| h[e] > 0 && h[e] -= 1 }
  end
end

In short, if

a = [3,1,2,3,4,3,2,2,4]
b = [2,3,4,4,3,4]

then

a - b           #=> [1]

whereas

a.difference(b) #=> [1, 3, 2, 2]

This method is elaborated in my answer to this SO question. I've found so many uses for it that I've proposed it be added to the Ruby Core.

The following method produces a hash whose keys are the elements of names (strings) and whose values are the fractions of the letters in the target string that are contained in each string in names.

def target_fractions(names, target)
  target_arr = target.downcase.scan(/[a-z]/)
  target_size = target_arr.size
  names.each_with_object({}) do |s,h|
    s_arr = s.downcase.scan(/[a-z]/)
    target_remaining = target_arr.difference(s_arr)
    h[s] = (target_size-target_remaining.size)/target_size.to_f
  end
end

Example

target = "Jimmy S. Bond"

and the names you are comparing are given by

names = ["Jill Dandy", "Boomer Asad", "Josefine Simbad"]

then

target_fractions(names, target)
  #=> {"Jill Dandy"=>0.5, "Boomer Asad"=>0.5, "Josefine Simbad"=>0.8} 

Explanation

For the above values of names and target,

target_arr = target.downcase.scan(/[a-z]/)
  #=> ["j", "i", "m", "m", "y", "s", "b", "o", "n", "d"] 
target_size = target_arr.size
  #=> 10

Now consider

s = "Jill Dandy"
h = {}

then

s_arr = s.downcase.scan(/[a-z]/)
  #=> ["j", "i", "l", "l", "d", "a", "n", "d", "y"]
target_remaining = target_arr.difference(s_arr)
  #=> ["m", "m", "s", "b", "o"]

h[s] = (target_size-target_remaining.size)/target_size.to_f
  #=> (10-5)/10.0 => 0.5
h #=> {"Jill Dandy"=>0.5}

The calculations are similar for Boomer and Josefine.