Dave Dave -4 years ago 105
Ruby Question

How to avoid "Invalid byte sequence" when looking for link with text using Nokogiri

I'm using Rails 5 with Ruby 4.2 and scanning a document that I parsed with Nokogiri, looking in a case insensitive way for a link with text:

a_elt = doc ? doc.xpath('//a').detect { |node| /link[[:space:]]+text/i === node.text } : nil


After getting the HTML of my web page in
content
, I parse it into a Nokogiri doc using:

doc = Nokogiri::HTML(content)


The problem is, I'm getting

ArgumentError invalid byte sequence in UTF-8


on certain web pages when using the above regular expression.

2.4.0 :002 > doc.encoding
=> "UTF-8"
2.4.0 :003 > doc.xpath('//a').detect { |node| /individual[[:space:]]+results/i === node.text }
ArgumentError: invalid byte sequence in UTF-8
from (irb):3:in `==='
from (irb):3:in `block in irb_binding'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `upto'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `each'
from (irb):3:in `detect'
from (irb):3
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
from bin/rails:4:in `require'
from bin/rails:4:in `<main>'


Is there a way I can rewrite the above to automatically account for the encoding or weird characters and not flip out?

Answer Source

Your question may have already been answered before. Have you tried the methods from "Is there any way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?"?

Specifically before the detect block, try to remove the invalid bytes and control characters except new line:

doc.scrub!("")
doc.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")

Remember, scrub! is a Ruby 2.1+ method.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download