Semyon Tamara - 10 months ago 49
R Question

# How to correctly extract a numeric component from complex strings in a data frame and substitute the strings with extraction output?

I have a data.frame with two variables of string expressions like "ABC`w/XYZ 8", where w = any number from 1 to 999. What I need to do is to substract w and substitute the whole string with it. I use this code:

``````df <- data.frame(a = c("ABC`5/XYZ 8", "A`25/BHU 19", "ach`246/chy 0"), b = c("sfse`3/cjd 65", "jlke`234/Chu 19", "h`45/hy 0"))

df\$a <- sapply(df\$a, function(x) {substr(df\$a[x], regexpr("`[0-9]+/", df\$a[x]) +1,
+  regexpr("`[0-9]+/", df\$a[x]) + attr(regexpr("`[0-9]+/", df\$a[x]), "match.length")-2)})
``````

It works, but instead of a = c(5, 25, 246) I get a = c(25, 5, 246). I guess this happens because of the factor class of a. However, when a is class character I get NAs as an output.
Is there a way to preserve the order of a or use sapply and substr for array of characters?

We can use `sub` to extract the numbers specified in the 'w' position of the string. Match the pattern of one or more alphabets along with "``", capture one or more numbers that follows it as a group (`(\\d+)`) followed by other characters (`.*`) and replace it with the backreference of the capture group.
``````as.numeric(sub("[A-Za-z`]+(\\d+).*", "\\1", df\$a))
Or another option is `str_extract`
``````library(stringr)