MichaelChirico MichaelChirico - 29 days ago 6
R Question

Dealing with ambiguous column names in data.table

I'm reading data from an Excel file into a data.table in R. The file is formatted like so:

COL_1_STUFF COL_2_STUFF COL_3_STUFF
ID EST MOE PCT EST MOE PCT EST MOE PCT


That is, for each variable (the
COL
s), there is an estimate, a margin of error, and a percentage given.

The trouble is being created by
read.xlsx2
, which I'm using to import the file like so:

data <- as.data.table(read.xlsx2(
"file.xlsx", sheetIndex = 1L, colIndex = c(1L, 4L, 7L), startRow = 2L))


The problem is that
read.xlsx2
assigns the same column name to a bunch of stuff--the import looks something like:

ID EST EST EST


even if I set
header = FALSE
, I'm apt to get something like

X1 X2 X2 X2


To circumvent this, I've done the following subsequent to import:

data[ , c("col1_est", "EST") := .(EST, NULL)]
data[ , c("col2_est", "EST") := .(EST, NULL)]
data[ , c("col3_est", "EST") := .(EST, NULL)]


This strikes me as an odd way to deal with the problem; can anyone suggest an alternate approach to this?

Answer

A perfect solution to this was covered by an update a week ago which adds a check.names argument to data.table (and to fread):

setDT(read.xlsx2("file.xlsx", sheetIndex = 1L,
                      colIndex = c(1L, 4L, 7L), startRow = 2L),
           check.names = TRUE)

Automatically adds .1, .2, etc. to duplicated column names.

Thanks to @DavidArenburg for original FR #1027 and to @Arun for implementation for data.table; didn't take much for me to port that to setDT so it could be done by reference.

Comments