Draco Draco - 3 months ago 23
Ruby Question

UTF-8 conversion not working with String#encode but Iconv

I had this with Iconv:

git_log = Iconv.conv 'UTF-8', 'iso8859-1', git_log


Now I want to change it to use String#encode due to deprecation warnings, but I can't, doesn't work:

git_log = git_log.encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => '')


I used to use Iconv here, and it's still working:

https://github.com/gamersmafia/gamersmafia/blob/master/lib/formatting.rb#L244

But when I replace these line with String#encode method, first gsub raises a "invalid byte sequence in UTF-8" error.

Do you know why?

Answer

In your call to String#encode you don’t specify a source encoding. Ruby is using the strings current encoding as the source, which appears to be UTF-8, and according to the docs:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

In other words the call has no effect, and leaves the bytes in the string as they are, encoded as ISO-8859-1. The next call to gsub then tries to interpret these bytes as UTF-8, and since they are invalid (they are unchanged from ISO-8859-1) you get the error you see.

String#encode has a a form that accepts the source encoding as the second parameter, so you can explicitly specify it, similarly to what you are doing with Iconv. Try this:

git_log = git_log.encode(Encoding::UTF_8,
                         Encoding::ISO_8859_1,
                         :invalid => :replace,
                         :undef => :replace,
                         :replace => '')

You could also use the ! form in this case, which has the same effect:

git_log.encode!(Encoding::UTF_8,
                Encoding::ISO_8859_1,
                :invalid => :replace,
                :undef => :replace,
                :replace => '')