Semyon Tamara Semyon Tamara - 1 month ago 13
R Question

How to correctly extract a numeric component from complex strings in a data frame and substitute the strings with extraction output?

I have a data.frame with two variables of string expressions like "ABC`w/XYZ 8", where w = any number from 1 to 999. What I need to do is to substract w and substitute the whole string with it. I use this code:

df <- data.frame(a = c("ABC`5/XYZ 8", "A`25/BHU 19", "ach`246/chy 0"), b = c("sfse`3/cjd 65", "jlke`234/Chu 19", "h`45/hy 0"))

df$a <- sapply(df$a, function(x) {substr(df$a[x], regexpr("`[0-9]+/", df$a[x]) +1,
+ regexpr("`[0-9]+/", df$a[x]) + attr(regexpr("`[0-9]+/", df$a[x]), "match.length")-2)})


It works, but instead of a = c(5, 25, 246) I get a = c(25, 5, 246). I guess this happens because of the factor class of a. However, when a is class character I get NAs as an output.
Is there a way to preserve the order of a or use sapply and substr for array of characters?

Answer

We can use sub to extract the numbers specified in the 'w' position of the string. Match the pattern of one or more alphabets along with "``", capture one or more numbers that follows it as a group ((\\d+)) followed by other characters (.*) and replace it with the backreference of the capture group.

as.numeric(sub("[A-Za-z`]+(\\d+).*", "\\1", df$a))
#[1]   5  25 246

Or another option is str_extract

library(stringr)
as.numeric(str_extract(df$a, "\\d+"))
#[1]   5  25 246