djhurio djhurio - 2 months ago 15
R Question

Read a UTF-8 text file with BOM

I have a text file with Byte order mark (U+FEFF) at the beginning. I am trying to read the file in R. Is it possible to avoid the Byte order mark?

The function

fread
(from the
data.table
package) reads the file, but adds
ļ»æ
at the beginning of the first variable name:

> names(frame_pers)[1]
[1] "ļ»æreg_date"


The same is with
read.csv
function.

Currently I have made a function which removes the BOM from the first column name, but I believe there should be a way how to automatically strip the BOM.

remove.BOM <- function(x) setnames(x, 1, substring(names(x)[1], 4))

> names(frame_pers)[1]
[1] "ļ»æreg_date"
> remove.BOM(frame_pers)
> names(frame_pers)[1]
[1] "reg_date"


I am using the native encoding for the R session:

> options("encoding" = "")
> options("encoding")
$encoding
[1] ""

Answer

Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:

As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications).