I have been parsing a single merged column containing multiple categories of information into separated columns for the unique categories. I have been using stringr for identifying category patterns that need to be separated out of the single column.
I was able to separate all category information that contained unique, repetitive, and identifiable patterns but am now left with the task of extracting category information that follows no obvious extract-able pattern.
Here is a basic example setup:
col1 <- c("a1 b1 apple c1","a2 b2 fruit c2","a3 b3 bunny(1) c3","a4 b4 x5 c4")
col2 <- c("b1","b2","b3","b4")
col3 <- c("a1","a2","a3","a4")
col4 <- c("c1","c2","c3","c4")
dat <- data.frame(col1,col2,col3,col4)
One way to do it is to use
setdiff which will capture the word of
col1 not found in
col2, 3, 4.
v1 <- unlist(Map(setdiff, strsplit(dat$col1, ' '), strsplit(apply(dat[,-1], 1, paste, collapse = ' '), ' '))) dat$col5 <- v1[v1 != ''] dat # col1 col2 col3 col4 col5 #1 a1 b1 apple c1 b1 a1 c1 apple #2 a2 b2 fruit c2 b2 a2 c2 fruit #3 a3 b3 bunny(1) c3 b3 a3 c3 bunny(1) #4 a4 b4 x5 c4 b4 a4 c4 x5
Note that your variables need to be characters (not factors) for this to work