ryry ryry - 3 months ago 8
R Question

How can you extract strings from a merged column based off of previously extracted strings in separate columns in R?

I have been parsing a single merged column containing multiple categories of information into separated columns for the unique categories. I have been using stringr for identifying category patterns that need to be separated out of the single column.

I was able to separate all category information that contained unique, repetitive, and identifiable patterns but am now left with the task of extracting category information that follows no obvious extract-able pattern.

Here is a basic example setup:

col1 <- c("a1 b1 apple c1","a2 b2 fruit c2","a3 b3 bunny(1) c3","a4 b4 x5 c4")
col2 <- c("b1","b2","b3","b4")
col3 <- c("a1","a2","a3","a4")
col4 <- c("c1","c2","c3","c4")

dat <- data.frame(col1,col2,col3,col4)


So now I need to extract the term in the third position of
dat$col1
.

>dat$col5
apple
fruit
bunny(1)
x5


I would like to do this based off of the previously extracted strings in
dat[,c(2,3,4)]
. This is a very basic example and I am trying to find the most reliable method for extracting data that may not fall evenly in the same position of
dat$col1
.

Answer

One way to do it is to use Map with setdiff which will capture the word of col1 not found in col2, 3, 4.

v1 <- unlist(Map(setdiff, strsplit(dat$col1, ' '), 
                          strsplit(apply(dat[,-1], 1, paste, collapse = ' '), ' ')))
dat$col5 <- v1[v1 != '']
dat
#               col1 col2 col3 col4     col5
#1    a1 b1 apple c1   b1   a1   c1    apple
#2    a2 b2 fruit c2   b2   a2   c2    fruit
#3 a3 b3 bunny(1) c3   b3   a3   c3 bunny(1)
#4      a4 b4 x5  c4   b4   a4   c4       x5

Note that your variables need to be characters (not factors) for this to work