pill45 pill45 - 3 months ago 8
R Question

In R, how can I manipulate variable in dataframe using regular expression?

This is the dataset

df1 <- data.frame("id" = c("ebi.ac.uk:MIAMExpress:Reporter:A-MEXP-503.100044",
"ebi.ac.uk:MIAMExpress:Reporter:A-MEXP-783.100435",
"ebi.ac.uk:MIAMExpress:Reporter:C-DEA-783.100435"),
"Name" = c("ABC", "DEF", ""))


The product of the dataset

id Name
1 ebi.ac.uk:MIAMExpress:Reporter:A-MEXP-503.100044 ABC
2 ebi.ac.uk:MIAMExpress:Reporter:A-MEXP-503.100435 DEF
3 ebi.ac.uk:MIAMExpress:Reporter:A-MEXP-503.100488


I want to make the dataframe look like this

id Name
1 100044 ABC
2 100435 DEF
3 100488 NA


Can anyone show me how to approach this problem?

Answer

Regex way to find the last dot:

df1$id <- as.character(df1$id)
regexpr("\\.[^\\.]*$", df1$id)

or sapply(gregexpr("\\.", x), tail, 1)

Easier to remember, non-regex way:

df1$id <- as.character(df1$id)

df1$id <- sapply(strsplit(df1$id,split="\\."),tail,1)
df1$Name[df1$Name == ""] <- NA

df1
      id Name
1 100044  ABC
2 100435  DEF
3 100435 <NA>

sapply(strsplit(df1$id,split="\\."),tail,1) is from here.