JaimeCr JaimeCr - 3 months ago 8
R Question

Error splitting rows based on column regex

I'm getting a strange error splitting rows of a data.table based on the value of one of the columns. If I create the following data.table:

example <- data.table(dt=c("2016-08-18 16:04:37.0", "2016-08-11 18:01:44.0", "2016-08-11 18:01:44.0"),
latitude = c(58.64347, 59.73744, 59.73744),
countrycode = c("SE", "SE", "SE"),
city = c("Mariestad", "Norrtälje", "Norrtälje"),
street = c("E20", "E18", "E18"),
streetnr = c(NA, NA, NA),
postalcode = c(54274, 76192, 76192))


And I try to split the first two rows based on the value of the column
street
(I'm trying to find patterns like E18/E20/E21), I can execute this:

example[1:2, strsplit(street, "/", fixed = T),by = "countrycode,city,streetnr,postalcode"]


And everything is fine, it doesn't split the rows because it doesn't find the "/".

countrycode city streetnr postalcode V1
1: SE Mariestad NA 54274 E20
2: SE Norrtälje NA 76192 E18


But if I try to split the last row:

example[2:3, strsplit(street, "/", fixed = T), by = "countrycode,city,streetnr,postalcode"]


I get this:

countrycode city streetnr postalcode V1 V2
1: SE Norrtälje NA 76192 E18 E18


As you can see, It creates two new columns (V1 and V2) with the same value ("E18"), whereas in the previous command, it just created one new column (V1). If I try to split just the first row (with a streetcode of "E20" instead of "E18"), it works without any issue:

example[1:1, strsplit(street, "/", fixed = T), by = "countrycode,city,streetnr,postalcode"]

countrycode city streetnr postalcode V1
1: SE Mariestad NA 54274 E20


Does anybody know why this is happening? Could be related to the character codification or something like that?

Any help would be greatly appreciated.

Answer

strsplit returns a list with length equal to your input vector. When DT[, j, by] sees a list in j, it interprets it as a list of columns. If you want everything in a single column, you can unlist:

example[, 
  .(ssplit = unlist(strsplit(street, "/", fixed = TRUE)))
, by = "countrycode,city,streetnr,postalcode"]

Doing by= here is very inefficient and requires enumerating all the col names. Possibly better:

s = strsplit(example$street, "/", fixed=TRUE)
example[rep(1:.N, lengths(s)), c(.SD, .(ssplit = s))]


                      dt latitude countrycode      city street streetnr postalcode ssplit
1: 2016-08-18 16:04:37.0 58.64347          SE Mariestad    E20       NA      54274    E20
2: 2016-08-11 18:01:44.0 59.73744          SE Norrtälje    E18       NA      76192    E18
3: 2016-08-11 18:01:44.0 59.73744          SE Norrtälje    E18       NA      76192    E18