Jeffrey Yong Jeffrey Yong - 1 month ago 5
Ruby Question

Opening file in Ruby

Code sample 1:


def count_lines1(file_name)
open(file_name) do |file|
count = 0
while file.gets
count += 1
end
count
end
end


Code sample 2:

def count_lines2(file_name)
file = open(file_name)
count = 0
while file.gets
count += 1
end
count
end


I am wondering which is the better way to implement the counting of lines in a file. In terms of good syntax in Ruby.

Answer

which is the better way to implement the counting of lines in a file.

Neither. Ruby can do it easily using foreach:

def count_lines(file_name)
  lines = 0
  File.foreach(file_name) { lines += 1 }
  lines
end

If I run that against my ~/.bashrc:

$ ruby test.rb
37

foreach is very fast and will avoid scalability problems.

Alternately, you could take advantage of tools in the OS, such as wc -l which were written specifically for the task:

`wc -l .bashrc`.to_i

which will return 37 again. If the file is huge, wc will likely outrun doing it in Ruby because wc is written in compiled code.


You can also read in large chunks with read and count newline characters.

Yes, read will allow you to do that, but the scalability issue will remain. In my environment read or readlines can be a script killer because we often have to process files well into the tens of GB. There's plenty of RAM to hold the data, but the I/O suffers because of the overhead of slurping the data. "Why is slurping a file bad?" goes into this.

An alternate way of reading in big chunks is to tell Ruby to read a set block size, count the line-ends in that block, looping until the file is read completely. I didn't test that method in the above linked answer, but in the past did similar things when I was writing in Perl and found that the difference didn't really improve things because it resulted in a bit more code. At that point, if all I was doing was counting lines, it'd make more sense to call wc -l and let it do the work as it'd be a lot faster for coding time and most likely in execution time.

Comments