Ger Cas Ger Cas - 3 years ago 57
Ruby Question

Matching pattern list between 2 files in Ruby

I'm trying to print lines from Input.txt that contains the strings in ValuesToSearch.txt. My current script shown below prints the correct output,
but when I try with actual data where Input.txt has 9.5 millions of lines and ValuesToSearch.txt has 300 lines, the processing is very very slow.

How can be modified the script in order to get faster the output? Thanks

Input.txt

ID HM PRAO LN AC
1401144 851 2 45 32
1401145 6D2 4 45 32
1401146 B33 1 45 32
1401147 EEC 9 45 32
1401148 730 1 45 32
1401149 C08 3 45 32
1401150 B91 4 45 32
1401151 978 1 45 32
1401152 6A9 0 45 32


ValuesToSearch.txt

1401176
1401148
1401149
1401151


My script:

ruby -e '
a=File.foreach("Input.txt").map {|l| l.split(" ")}
b=File.foreach("ValuesToSearch.txt").map {|l| l.split(" ")}.flatten

b.map{ |z|
a.map{ |i| puts i.join(" ") if i.include?(z) }
}'

1401148 730 1 45 32
1401149 C08 3 45 32
1401151 978 1 45 32

Answer Source

What about this?

dict = File.read('/tmp/ValuesToSearch.txt').split.inject({}) do |acc, word|
  acc[word] = true
  acc
end

File.foreach('/tmp/Input.txt') do |line|
  puts line if line.split.any? { |word| dict[word] }
end

In this approach, I'm using a Hash to store the "values to search".
Thus, we can search in O(1) (instead of O(N)).

And you don't need to iterate twice in the words of the Input.txt.
You can print the lines you want in a single iteration.

And as suggested by @tadman, put this script in a file and execute it using ruby myscript.rb.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download