Dave Dave - 2 months ago 13
Ruby Question

In Ruby, how do I deal with non-UTF 8 characters in PDF content?

I’m using Rails 4.2.7. I’m downloading and writing PDF content from the web, like so …

res1 = Net::HTTP.SOCKSProxy('127.0.0.1', 50001).start(uri.host, uri.port) do |http|
puts "launching #{uri}"
resp = http.get(uri)
status = resp.code
content = resp.body
content_type = resp['content-type']
content_encoding = resp['content-encoding']
end

if content_type == 'application/pdf' || content_type.include?('application/x-javascript')
File.open(file_location, "w") { |file| file.write content }


I’m noticing that for some content, I get the below error

Error during processing: "\xC2" from ASCII-8BIT to UTF-8
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `write'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `block in pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `open'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_import_service.rb:76:in `process_race_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_race_finder_service.rb:75:in `process_race_link'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:29:in `block in process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `each'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `process_data'
/Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:18:in `block in run_all_crawlers'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord-4.2.7.1/lib/active_record/relation/delegation.rb:46:in `each'


I tried accounting for it, by replacing invalid characters, like so …

File.open(file_location, "w") { |file| file.write content }
content.encode('UTF-8', :invalid => :replace, :undef => :replace)


but then I get the error

error: PDF malformed, expected 'endstream' but found 0 instead


when trying to read the PDF file. Does anyone know of a better way to deal with downloaded PDF docs that won’t corrupt them?

Answer

I think the easiest solution would be to write it as is using IO#binwrite:

File.binwrite(file_location, content)

The above might fail, if files you receive might be in different encodings, In that case I would try to

content.force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8)