Karthick Karthick - 6 months ago 26
Ruby Question

ruby process large file in loop as input another file

I have large file which has more than 1L line and another file which has the input-stings which I need to use to get the lines only matches in that large file.

I was able to do it in below way,

File.open(stings_file, 'r') do |l|
File.open(large_file, 'r') do |line|
next if !line.include?(l)
puts line

But in this way it will open the large file for each loop. For example in input-stings I have strings in 100 lines, so when it executes it will open that large file 100 times to process it which takes more time to complete.

Is there way fastest method to avoid opening same large file 100times.


First of all you'll have a geometric scaling problem if you get this wrong. If input file A has N lines and B has M lines then you'll need to do N*M tests to check for overlap. That can be impossibly slow.

Instead, pull in the input lines and stick them in something you can use for quick lookups:

require 'set'
match_lines = Set.new(File.readlines(strings_file).map(&:chomp))

Then you can test very quickly here:

File.foreach(large_file) do |line|
  print line if (match_lines.include?(line.chomp))

I'm using chomp here to avoid failing to match if the last line in your match file doesn't have a newline at the end or if you're using CRLF encoding in one and LF in the other.