sweetmusicality sweetmusicality - 1 month ago 7
R Question

Extracting text based on condition in R

I am relatively new to R. I have a character variable named

RN
whose text needs to be extracted into 2 variables [
named_RN
and
general_RN
] based on some conditions on
RN
. This is what the desired result is (right now,
named_RN
and
general_RN
are blank - I don't know how to code this part and that's what I need help on!):

RN named_RN general_RN
RP4A60D26L (Pentazocine) Pentazocine
0 (Complement C4) Complement C4
0 (Aminocap) U6206 (Amino) Amino Aminocap
N3R30 (Amiodarone) 0 (Benzo) 0 (Ferri) Amiodarone Benzo, Ferri


As you can see, I am trying to extract the information within the parentheses. However, I want to extract from
RN
into
general_RN
if it has a code of
0
and extract into
named_RN
if it has a non-zero code.

The main problem I am running into is that I cannot gsub by
0 (
or
0 (
[space before 0 in the latter one because sometimes the
0
code starts in the middle of the text in
RN
as is the case in the last row] because some of the codes for
named_RN
end with
0 (
as is the case in the last row.

Please advise.

Thank you!

Answer Source

Here's one way to do it. Basically, I create a new column where matches are easier to detect. Then, I match the inside of the parenthesis with regmatches.

df <- read.table(text="RN
'RP4A60D26L (Pentazocine)'
'0 (Complement C4)'
'0 (Aminocap) U6206 (Amino)'
'N3R30 (Amiodarone) 0 (Benzo) 0 (Ferri)'",header=TRUE,stringsAsFactors=FALSE)

df$RN_temp <- gsub("^[0] "," general_RN",df$RN) #replace leading 0s w/ general_RN
df$RN_temp <- gsub(" [0] "," general_RN",df$RN_temp) #replace other " 0 "
df$RN_temp <- gsub(" \\("," named_RN(",df$RN_temp) #replace rest w/ named_RN
df$RN_temp

df$named_RN <- regmatches(df$RN_temp,gregexpr("(?<=named_RN\\().*?(?=\\))",
                df$RN_temp, perl=TRUE))
df$general_RN <- regmatches(df$RN_temp,gregexpr("(?<=general_RN\\().*?(?=\\))", 
                  df$RN_temp, perl=TRUE))
df$RN_temp <- NULL
df

EDIT To transform into a data.frame. I use lapply(df$named_RN, function(x) ifelse(is.null(x), NA, x)) to change missing values (NULL) to NA.

df$named_RN <- unlist(lapply(df$named_RN, function(x) ifelse(is.null(x), NA, x)))
df$general_RN <- unlist(df$general_RN)

'data.frame':   4 obs. of  3 variables:
 $ RN        : chr  "RP4A60D26L (Pentazocine)" "0 (Complement C4)" "0 (Aminocap) U6206 (Amino)" "N3R30 (Amiodarone) 0 (Benzo) 0 (Ferri)"
 $ named_RN  : chr  "Pentazocine" NA "Amino" "Amiodarone"
 $ general_RN: chr  "Complement C4" "Aminocap" "Benzo" "Ferri"
                                      RN    named_RN    general_RN
1               RP4A60D26L (Pentazocine) Pentazocine              
2                      0 (Complement C4)             Complement C4
3             0 (Aminocap) U6206 (Amino)       Amino      Aminocap
4 N3R30 (Amiodarone) 0 (Benzo) 0 (Ferri)  Amiodarone  Benzo, Ferri