Geradlus_RU Geradlus_RU - 6 months ago 13
Linux Question

Haskell: quoteFile fails on text file with "invalid byte sequence" on unicode characters

I'm facing issue with

quoteFile
in my virtual environment (Debian Wheezy with GHC 7.8.4 installed). I have described file oriented version of
st
quasi quoter from
Text.Shakespeare.Text
:

import Language.Haskell.TH.Quote (QuasiQuoter, quoteFile)
import Text.Shakespeare.Text (st)

sfFile :: QuasiQuoter
stFile = quoteFile st


This works very well on my host machine, however, this fails with following error on my virtual environment (a Docker image):


Exception when trying to run compile-time code:
test-file.md: hGetContents: invalid argument (invalid byte sequence)

Code: Language.Haskell.TH.Quote.quoteExp
stFile "test-file.md"


I little REPL investigation shows, that error occurs on first unicode character in text file, in my current case this is '«' left-pointer double angle quotation mark:

import System.IO (IOMode(..), hGetContents, openFile, openBinaryFile, utf8)

main =
do h <- openBinaryFile "test-file.md" ReadMode
hGetContentContents h
-- Binary read works fine out-of-box.

h' <- openFile "test-file.md" ReadMode
hSetEncoding h' utf8
hGetContentContents h'
-- This works only if encoding is explicitly set, otherwise
-- it gives "invalid byte sequence" error at run-time


It seems to me that I need either to configure a bit my virtual environment, or probably rebuild GHC itself.

I tried to set locale to
en.UTF-8 UTF-8
but this does not helped (initially I did no locale configuration at all).

Update: target file has
UTF-8
encoding:

$ file -bi test-file.md
text/x-c++; charset=utf-8

Answer

Finally, I've found that my virtual locale was not properly set, e.g. locale command showed me that all LANG variables are set to POSIX.

Exporting LANG variable to command is the quickest workaround (bash example):

export LANG=en_US.uft8 cabal build

However, likely you need to have en_US locale installed, Debian manual configuration is:

  1. edit the file /etc/locale.gen, append new line en_US.UTF-8 UTF-8
  2. invoke locale-gen to generate locales.
  3. export LANG variable.

Debian locales wiki1

P.S. My default Debian Wheezy installation had C.UTF-8 in default locales list, so I believe in purposes of minimalism is it possible use it rather than install additional English locale, but I didn't test it by myself.

Comments