marcamillion marcamillion - 1 year ago 183
Ruby Question

incompatible character encodings: ASCII-8BIT and UTF-8 in Oga gem

I am using an XML/HTML parser called Oga.

I am attempting to crawl this URL: and parse the body for text, like so:

def get_page
body = Net::HTTP.get(URI.parse(@url))
document = Oga.parse_html(body)

document = get_page
words = document.css('body').text

When I get this error:

/gems/oga-2.7/lib/oga/xml/node_set.rb:276:in block in text': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)

That is related to this bit of code here.

What could be causing this and how can I fix it? Is there a way for me to fix it locally, or do I have to fork the gem, fix that method and then use my fork?


Answer Source

The bit of code you linked has nothing to do with the glitch, that is the issue of body is being interpreted in wrong encoding. Try adding body = body.force_encoding 'UTF-8' before parsing a document:

def get_page
  body = Net::HTTP.get(URI.parse(@url)).force_encoding 'UTF-8'
  document = Oga.parse_html(body)
