Karthick Karthick - 3 months ago 9
Ruby Question

How to process a large file in loop as input for another file

I have a large file which has more than 1M lines, and another file which has the input-strings I need to use to get the lines matching in the large file.

I was able to do it this way:

File.open(strings_file, 'r') do |l|
File.open(large_file, 'r') do |line|
next if !line.include?(l)
puts line
end
end


But, it will open the large file for each loop. For example, in input-strings, I have 100 lines of strings, so when it executes it will open that large file 100 times to process it, which takes more time to complete.

Is there way faster method to avoid opening the large file 100 times?

Answer

First of all you'll have a geometric scaling problem if you get this wrong. If input file A has N lines and B has M lines then you'll need to do N*M tests to check for overlap. That can be impossibly slow.

Instead, pull in the input lines and stick them in something you can use for quick lookups:

require 'set'
match_lines = Set.new(File.readlines(strings_file).map(&:chomp))

Then you can test very quickly here:

File.foreach(large_file) do |line|
  print line if (match_lines.include?(line.chomp))
end

I'm using chomp here to avoid failing to match if the last line in your match file doesn't have a newline at the end or if you're using CRLF encoding in one and LF in the other.