rtreacy rtreacy - 1 month ago 11
R Question

Splitting character vector into data frame when the separating character is in the string

I have a dataframe of the form:

B <- data.frame(B=c(rep(" 'abcefgh.abc_123.1_123.1'",length=50),
rep(" 'ab[+12.1]abcdefgh.abc_123.1_123.1'",length=50)))


I need to split this single column into 4 columns. My first attempt was to just use a for loop and the strsplit() command to cut up each observation and paste it back together in the desired format.

Bsplit <- data.frame()
for (i in 1:nrow(B)){
temp3 <- strsplit(as.character(B$B[i]),split='_', fixed= TRUE)
temp4 <- strsplit(temp3[[1]][1],split='.',fixed= TRUE)
if(is.na(temp4[[1]][3])){
bsplit <- data.frame(a=temp4[[1]][1],b=temp4[[1]][2],c=temp3[[1]][2],d=temp3[[1]][3])
Bsplit <- rbind(Bsplit,bsplit)
}
else {
bsplit <- data.frame(a=paste(temp4[[1]][1],'.',temp4[[1]][2],sep=''),b=temp4[[1]][3],
c=temp3[[1]][2],d=temp3[[1]][3])
Bsplit <- rbind(Bsplit,bsplit)
}
}


This gives the desired result but it is far to slow to be practical. On my second attempt I used a combination of the cSplit_f() command and stri_split_fixed().

library(stringi)
library(splitstackshape)

X <- cSplit_f(B,1,sep='_')
Y <- lapply(data.frame(X[[1]]),stri_split_fixed,pattern='.',simplify= TRUE)


The problem is, when a string takes the form 'ab[+12.1]abcdefgh.abc_123.1_123.1' r cuts the string like this 'ab[+12' | 'abcdefgh' | 'abc' | 123.1 | 123.1. How do I protect the string so it ignores the '.' separator and returns 'ab[+12.1]abcdefgh' | 'abc' | 123.1 | 123.1.

Answer

A base R attempt which makes use of regular expression grouping:

Data:

mydf <- data.frame(B=c(rep(" 'abcefgh.abc_123.1_123.1'",length=50),
                rep(" 'ab[+12.1]abcdefgh.abc_123.1_123.1'",length=50)))

Code:

new_df <- do.call(rbind, strsplit(gsub("(['\\w\\+\\.\\[]*)(\\]*)([a-z]+)(\\.)([\\w\\.']+)",
                             "\\1\\2\\3_\\5",
                             trimws(mydf$B),
                             perl = T), split = "_"))
new_df <- data.frame(new_df)

Output:

# Just a select number of rows
 X1                 X2  X3    X4    
 'abcefgh           abc 123.1 123.1'
 'abcefgh           abc 123.1 123.1'
 'abcefgh           abc 123.1 123.1'
 'abcefgh           abc 123.1 123.1'
 'abcefgh           abc 123.1 123.1'
 'abcefgh           abc 123.1 123.1'
 'ab[+12.1]abcdefgh abc 123.1 123.1'
 'ab[+12.1]abcdefgh abc 123.1 123.1'
 'ab[+12.1]abcdefgh abc 123.1 123.1'
 'ab[+12.1]abcdefgh abc 123.1 123.1'
 'ab[+12.1]abcdefgh abc 123.1 123.1'
 'ab[+12.1]abcdefgh abc 123.1 123.1'

Explanation:

The idea here to group each row into 5 chunks and use gsub to target the chunks that would constitute your new columns. I will use 'ab[+12.1]abcdefgh.abc_123.1_123.1' as an example. Here, you want to group the string in the following chunks: 'ab[+12.1, ], abcdefgh, . and abc_123.1_123.1', and then you can concatenate the groups back together except for the fourth group which is replaced with _. At this point you have all the four columns you need, separated by _. Subsequently, you can go right ahead and split your new row on _ to generate 4 different columns.

I hope this helps.