sc73 sc73 - 3 months ago 23
R Question

Splitting irregular columns in r

I have a column in my data frame that I need to split into multiple columns using the delimiter "_". However, I need to retain only the last two columns from the output (which will always contain the data that I need) in each row. In many records, the number of delimiters is different, therefore resulting on a different number of columns when split. How do I get only the last two columns of each observation. The following are examples of some of the records

unique(data$tagid.1)
[1] tag id 00000_0_0900_226000013189
[3] 00000_0_0986_114100005288 00000_0_0900_226000132078
[5] 00000_0_09LA_00000_0_0900_226000 00000_0_0900_226000131998
[7] 0000_2004000000000847 00000_0_0900_22600001\a\0048\022LI
[9] 00000_0_0900_226000013189I 00000_0_0986_114100006473


I'm trying to get an output something like:

tagid$C1 tagid$C2
0986 114100005288
0900 226000013189
0900 226000
etc.... etc....


My solution has some problems, i.e. it outputs two rows with 57k columns and its kind of slow, anyone have a better solution than:

> data.tag <- as.data.frame(data$tagid.1)
> tag1 <- cSplit(data.tag,"data$tagid.1",sep="_")
>
> head(tag1)
data$tagid.1_1 data$tagid.1_2 data$tagid.1_3 data$tagid.1_4 data$tagid.1_5 data$tagid.1_6 data$tagid.1_7
1: tag id NA NA NA NA NA NA
2: 00000 0 0900 226000013189 NA NA NA
3: 00000 0 0900 226000013189 NA NA NA
4: 00000 0 0900 226000013189 NA NA NA
5: 00000 0 0900 226000013189 NA NA NA
6: 00000 0 0900 226000013189 NA NA NA
>
> lastValue <- function(x) tail(x[!is.na(x)], 2)
> tag2 <- as.data.frame(apply(tag1, 1, lastValue))
> dim(tag2)
[1] 2 56997

Answer

It is possible to achieve this by using a regular expression:

pat <- "^.*_(.*)_(.*)$"
data.tag <- data.frame(tagid.1 = c("tag id",
                         "00000_0_0900_226000013189",
                         "00000_0_0986_114100005288",
                         "00000_0_0900_226000132078",
                          "00000_0_0900_22600001\a\0048\022LI"))
data.frame(C1 = sub(pat, "\\1", data.tag[,1]),
           C2 = sub(pat, "\\2", data.tag[,1]))


      C1                    C2
1 tag id                tag id
2   0900          226000013189
3   0986          114100005288
4   0900          226000132078
5   0900 22600001\a\0048\022LI