Daniel Falbel Daniel Falbel - 3 months ago 19
R Question

stringr::str_sub output is unexpected

Consider the folowing data.frame:

df <- structure(list(sufix = c("atizado", "atoria", "atório", "auta",
"áutico", "ável"), min_stem_len = c(4, 5, 3, 5, 4, 2), replacement = c("",
"", "", "", "", ""), exceptions = list(character(0), character(0),
character(0), character(0), character(0), c("afável", "razoável",
"potável", "vulnerável"))), .Names = c("sufix", "min_stem_len",
"replacement", "exceptions"), row.names = 21:26, class = c("tbl_df",
"tbl", "data.frame"))


I have a list of strings in variable
sufix
of this data.frame.
Now I have a word
word <- "amável"
and I want to get the sufix of this word with the same length as each word of the
df$sufix
.

I'm using the folowing code:

library(stringr)
word <- "amável"
str_sub(word, start = -stringr::str_length(df$sufix))


But this outputs this:

> str_sub(word, start = -stringr::str_length(df$sufix))
[1] "amável" "mável" "mável" "vel" "mável" "vel"
> df$sufix
[1] "atizado" "atoria" "atório" "auta" "áutico" "ável"


I was expecting that the last element of the resulting vector to be "ável" since:

> str_length("ável")
[1] 4
> str_sub(word, start = -4)
[1] "ável"





Here a more simple reproducible example:

set.seed(100)
a <- sample(1:10, 10000, replace = T)
res <- rep("ábc", 10000) %>% str_sub(start = -a)
sum(ifelse(a > 3, 3, a) != str_length(res))
[1] 2504

Answer

If you notice, all the results are wrong (except by the first one).

They should have been

[1] "amável" "amável" "amável" "ável"   "amável" "ável" 

This could be solved easily by

library(stringi)
stri_sub(rep(word, 6), from = -stri_length(df$suffix))

I bet you could reuse your stringr code just the same.

### EDIT TO ADD ###

I now understand what you mean. Definitely there's a strange behavior realated, most likely, to the special character á. See the example below:

df <- data.frame(suffix = c("Lorem","ipsum","dolor","sit","amet","consectetur","adipiscing", "elit","Donec","arcu")) 
df$len <- stri_length(df$suffix)

Then look at the strange behavior in the 7th element of the result:

stri_sub("amavel", from = -df$len)
##  [1] "mavel"  "mavel"  "mavel"  "vel"    "avel"   "amavel" "amavel" "avel"  
##  [9] "mavel"  "avel" 

# Compared to
stri_sub("amável", from = -df$len)
##  [1] "mável"  "mável"  "mável"  "vel"    "ável"   "amável" "mável"  "ável"  
##  [9] "mável"  "ável"

Weird enough, the result is corrected in the last case if rep is used:

stri_sub(rep("amável", 10), from = -df$len)
## [1] "mável"  "mável"  "mável"  "vel"    "ável"   "amável" "amável" "ável"  
## [9] "mável"  "ável"

# note how the 7th element is now correct.

So even though it's a bit hacky, the solution provided above should work.

I tried looking at the code of stri_sub, where it refers to C_stri_sub, but that was a dead end for me. Perhaps somebody more knowledgeable of C and/or string manipulation can come and lend a hand?

### SECOND EDIT ###

It seems to me the problem is with the repetition of the string inside the call to stri_sub. Look at this alternative code to the one you put in your edit:

set.seed(100)
a <- sample(1:10, 10000, replace = TRUE)
res <- stri_sub(rep("ábc", 10000), from = -a)
sum(ifelse(a > 3, 3, a) != stri_length(res))
## [1] 0
Comments