Ben Ben - 11 months ago 52
R Question

How to speed up this string split operation within data.table

I have a data.table with some IDs pasted together as a single character column, separated by underscores. I'm trying to split the ids into separate columns, but my best method is really slow for my large dataset (~ 250M rows). Interestingly, the operation doesn't appear to take O(N) time which is what I'd expect. In other words, it's pretty fast until about 50M+ rows and then gets really slow.

Make some data

sim_rows <- 40000000
dt <- data.table(
LineId = rep("L0123", times=sim_rows),
StationId = rep("S0123", times=sim_rows),
TimeId = rep("T0123", times=sim_rows)
dt[, InfoId := paste(LineId, StationId, TimeId, sep="_")]
dt[, c("LineId", "StationId", "TimeId") := NULL]

1: L0123_S0123_T0123
2: L0123_S0123_T0123
3: L0123_S0123_T0123
4: L0123_S0123_T0123
5: L0123_S0123_T0123
39999996: L0123_S0123_T0123
39999997: L0123_S0123_T0123
39999998: L0123_S0123_T0123
39999999: L0123_S0123_T0123
40000000: L0123_S0123_T0123

Check timings

system.time( dt[1:10000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
user system elapsed
5.179 0.634 3.867

system.time( dt[1:20000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
user system elapsed
7.805 0.958 7.703

system.time( dt[1:30000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
user system elapsed
12.556 1.782 12.349

system.time( dt[1:40000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
user system elapsed
29.260 2.822 29.895

Check result

InfoId LineId StationId TimeId
1: L0123_S0123_T0123 L0123 S0123 T0123
2: L0123_S0123_T0123 L0123 S0123 T0123
3: L0123_S0123_T0123 L0123 S0123 T0123
4: L0123_S0123_T0123 L0123 S0123 T0123
5: L0123_S0123_T0123 L0123 S0123 T0123
39999996: L0123_S0123_T0123 L0123 S0123 T0123
39999997: L0123_S0123_T0123 L0123 S0123 T0123
39999998: L0123_S0123_T0123 L0123 S0123 T0123
39999999: L0123_S0123_T0123 L0123 S0123 T0123
40000000: L0123_S0123_T0123 L0123 S0123 T0123

How can I speed this baby up?

Answer Source

stringr is newer, based on stringi internally and is often even faster.

Also, both stringi and to a lesser extent stringr have multiple variants of each string operation (fixed/coll/regex/words/boundaries/charclass) which are optimized for the type of operand.

Try str_split(..., '_'), it should be very fast.

> system.time( dt[1:1e6, c("LineId", "StationId", "TimeId") := str_split(InfoId, "_")] )
   user  system elapsed 
  1.063   0.014   1.077  # on my old machine; please tell us your numbers?