ryry ryry - 2 months ago 5
R Question

How can you extract strings from a merged column based off of previously extracted strings in separate columns in R?

I have been parsing a single merged column containing multiple categories of information into separated columns for the unique categories. I have been using stringr for identifying category patterns that need to be separated out of the single column.

I was able to separate all category information that contained unique, repetitive, and identifiable patterns but am now left with the task of extracting category information that follows no obvious extract-able pattern.

Here is a basic example setup:

col1 <- c("a1 b1 apple c1","a2 b2 fruit c2","a3 b3 bunny(1) c3","a4 b4 x5 c4")
col2 <- c("b1","b2","b3","b4")
col3 <- c("a1","a2","a3","a4")
col4 <- c("c1","c2","c3","c4")

dat <- data.frame(col1,col2,col3,col4)

So now I need to extract the term in the third position of


I would like to do this based off of the previously extracted strings in
. This is a very basic example and I am trying to find the most reliable method for extracting data that may not fall evenly in the same position of


One way to do it is to use Map with setdiff which will capture the word of col1 not found in col2, 3, 4.

v1 <- unlist(Map(setdiff, strsplit(dat$col1, ' '), 
                          strsplit(apply(dat[,-1], 1, paste, collapse = ' '), ' ')))
dat$col5 <- v1[v1 != '']
#               col1 col2 col3 col4     col5
#1    a1 b1 apple c1   b1   a1   c1    apple
#2    a2 b2 fruit c2   b2   a2   c2    fruit
#3 a3 b3 bunny(1) c3   b3   a3   c3 bunny(1)
#4      a4 b4 x5  c4   b4   a4   c4       x5

Note that your variables need to be characters (not factors) for this to work