spektom spektom - 10 months ago 23
Ruby Question

Ruby fixing multiple encoding documents

I'm trying to retrieve a Web page, and apply a simple regular expression on it.
Some Web pages contain non-UTF-8 characters, even though UTF-8 is claimed in Content-Type (example). In these cases I get:

ArgumentError (invalid byte sequence in UTF-8)

I've tried to use the following methods for sanitizing bad characters, but none of them helped to solve the issue:

  1. content = Iconv.conv("UTF-8//IGNORE", "UTF-8", content)

  2. content.encode!("UTF-8", :illegal => :replace, :undef => :replace, :replace => "?")

Here's the complete code:

response = Net::HTTP.get_response(url)
@encoding = detect_encoding(response) # Detects encoding using Content-Type or meta charset HTML tag
if (@encoding)
@content =response.body.force_encoding(@encoding)
@content = Iconv.conv(@encoding + '//IGNORE', @encoding, @content);
@content = response.body

@content.gsub!(/.../, "") # bang

Is there a way to deal with this issue? Basically, what I need is to set the base URL meta tag, and inject some Javascripts into the retrieved Web page.



I had a similar problem importing emails with different encodings, I ended with this:

def enforce_utf8(from = nil)
    self.is_utf8? ? self : Iconv.iconv('utf8', from, self).first
    converter = Iconv.new('UTF-8//IGNORE//TRANSLIT', 'ASCII//IGNORE//TRANSLIT') 
    converter.iconv(self).unpack('U*').select{ |cp| cp < 127 }.pack('U*')

at first, it tries to convert from *some_format* to UTF-8, in case there isn't any encoding or Iconv fails for some reason, then apply a strong conversion (ignore errors, translit chars and strip non recognized chars).

let me know if it works for you ;)