Hoap Humanoid Hoap Humanoid - 6 months ago 39
R Question

Faster way to trim a long character vector in R

I have a big data set with around 500.000 rows. Each of them are strings. I would like to trim all rows to a fixed size.

I found this:

dt$rev <- strtrim(dt$rev, width=max_len)

However it takes too long. Is there a faster way?


This has nothing to do with data.table. It's just that strtrim() is fairly slow.

As long as you're operating on single-width characters (i.e., characters that aren't, for instance, Chinese/Japanese/Korean), you can instead use substr(), which is much faster.

## Make a long character vector with 5 million elements
x <- rep(state.name, 1e5)

## Speed comparison
system.time(substr(x, 1, 3))
#   user  system elapsed 
#   0.43    0.00    0.44 
system.time(strtrim(x, 3))
#   user  system elapsed 
#  44.63    0.03   44.85

## Confirm that both methods return the same output
identical(substr(state.name,1,3), strtrim(state.name,3))
# [1] TRUE