Draco Draco - 1 year ago 239
Ruby Question

UTF-8 conversion not working with String#encode but Iconv

I had this with Iconv:

git_log = Iconv.conv 'UTF-8', 'iso8859-1', git_log

Now I want to change it to use String#encode due to deprecation warnings, but I can't, doesn't work:

git_log = git_log.encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => '')

I used to use Iconv here, and it's still working:


But when I replace these line with String#encode method, first gsub raises a "invalid byte sequence in UTF-8" error.

Do you know why?

Answer Source

In your call to String#encode you don’t specify a source encoding. Ruby is using the strings current encoding as the source, which appears to be UTF-8, and according to the docs:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

In other words the call has no effect, and leaves the bytes in the string as they are, encoded as ISO-8859-1. The next call to gsub then tries to interpret these bytes as UTF-8, and since they are invalid (they are unchanged from ISO-8859-1) you get the error you see.

String#encode has a a form that accepts the source encoding as the second parameter, so you can explicitly specify it, similarly to what you are doing with Iconv. Try this:

git_log = git_log.encode(Encoding::UTF_8,
                         :invalid => :replace,
                         :undef => :replace,
                         :replace => '')

You could also use the ! form in this case, which has the same effect:

                :invalid => :replace,
                :undef => :replace,
                :replace => '')
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download