mm19 mm19 - 17 days ago 5
Ruby Question

Creating a HTML Parser in Ruby

I need help figuring out a programming problem that I've been working on.

The problem description:

Write a function in Ruby that accepts an HTML document (a string) and a keyword (also a string). The function will find all occurrences of the keyword in the HTML string after the

<body>
element unless the keyword appears within an HTML tag, then surround the string found with tags to ``highlight’’ the keyword. For example,

<span style="background-color: blue; color: white">keyword</span>


You will have to be careful not to highlight strings occurring within an HTML
tag. For example, if the keyword is ``table’’, you wouldn’t want to mark up
this:

<table width="100%" border="0">


What I have done so far:

puts "Welcome to the HTML keyword highlighter!"
puts "Please Enter A Keyword: "
keyword = gets.chomp
canEdit = false

infile = File.new("desktop/code.html", "r")
outfile = File.new("Result.html", "w")

infile.each{ |i|
if (i.include? "<body>")
canEdit = true

end

if (i.include? "</body>")
canEdit = false
end

if(canEdit == true)
keyword.gsub(keyword, "<span style=\"background-color: yellow; color: black\">#{keyword}</span>")

outfile.write i
end

outfile.close()
infile.close()
}


The error I receive currently:

Welcome to the HTML keyword highlighter!

Please Enter A Keyword:

simple

/Users/Eva/Desktop/Personal/part4_program.rb:16:in `each': closed stream (IOError)

from /Users/Eva/Desktop/Personal/part4_program.rb:16:in `<main>'


I'm unsure what is causing the error and could use some guidance to fix the issue. I am also wondering if this program is heading in the right direction as an answer to the programming problem. I know Nokogiri is already available as a resource but I had hoped not to have to use it unless its thought to be a better option.

Answer

I'm unsure what is causing the error and could use some guidance to fix the issue.

Let's first apply some proper formatting to your code, to see more clearly what is going on:

puts 'Welcome to the HTML keyword highlighter!'
puts 'Please Enter A Keyword: '
keyword = gets.chomp
can_edit = false 

infile = File.new('desktop/code.html', 'r')
outfile = File.new('Result.html', 'w')

infile.each {|i| 
  if i.include?('<body>')
    can_edit = true
  end

  if i.include?('</body>')
    can_edit = false
  end

  if can_edit
    keyword.gsub(keyword, %Q[<span style="background-color: yellow; color: black">#{keyword}</span>])
    outfile.write i
  end

  outfile.close
  infile.close
}

The error message says:

    part4_program.rb:16:in `each': closed stream (IOError)

So, what is happening is that you try to iterate using each over a closed file. And why is that? Well, now that the code is indented properly, we can easily see that you close both infile and outfile inside of the each iterator. This will lead to all sorts of problems:

  • You close the file while each is still iterating over it. This will "pull the rug out under each's feet", so to speak. How can it iterate over the file when the file is closed? You should be lucky that each detects this and you got a nice error message and a clean exit – closing out the file out from under the iterator that is currently reading it, may have led to much subtler and harder to diagnose problems.
  • Even if each didn't break because you closed the file out from unter it, you still call close every time you go through the iteration, but you can only close a file once, after that it is already closed and can't be closed again.
  • And even if you could close files multiple times, you write to outfile, but you already closed it during the previous iteration. You can't write to a closed file.

I am also wondering if this program is heading in the right direction as an answer to the programming problem.

Honestly, I don't even remotely understand what you are trying to do. But I am going to say "No", you are not heading in the right direction.

Here are just a couple of simple ways to break your code:

  • what if the keyword is table?
  • what if <body> and </body> are on the same line?
  • what if the keyword appears on the line as <body> but before it?
  • what if someone spells it <BODY> or <bOdY> instead?
  • what about optional tags?
  • what about Null End Tags?
  • what if the keyword appears inside a comment?
  • what if the keyword appears inside a tag?
  • what if the keyword appears inside an attribute?
  • what if the keyword appears inside a <script> element?
  • what if the keyword appears inside a <style> element?
  • what if the keyword appears inside a <![CDATA[ section?

I know Nokogiri is already available as a resource but I had hoped not to have to use it unless its thought to be a better option.

HTML is complex. Really complex. Really, really complex. Unless you have some very good reasons to re-invent the wheel, you should re-use the work someone else has already done. Without even thinking too hard, I could come up with more than half a dozen ways to break your parser, and I didn't even get into the nasty corner cases. (Simply because I don't know the nasty corner cases, because I don't need to know them, because somebody else has already figured them all out.)

The two fundamentals of Programming are Abstraction and Reuse. Creating Reusable Abstractions and Reusing other programmer's Abstractions.