user3067923 user3067923 - 16 days ago 6
R Question

splitting vector by regular expression into dataframe

I have a vector that looks like this

head(val)
[1] "PD2323 [403-407]" "P05230 [455-459]"


I would like to split it into a dataframe with 3 columns and many rows. The output should look something like this:

head(output)
[,1] [,2] [,3]
[1,] "P20700" 403 407
[2,] "P05787" 455 459
[3,] "O14641" 168 178


However, when I try to set this up, I end up getting a matrix with more than 3 columns

head(strsplit(val, "\\s+"))

[[1]]
[1] "PD2323" "[403-407]"

[[2]]
[1] "P05230" "[455-459]"

[[3]]
[1] "AS14641" "[168-178]"

[[4]]
[1] "SS7Z3Z4" "[424-428]"

[[5]]
[1] "QQN4C6-2" "[671-679]"

[[6]]
[1] "DD9Y3B2" "[7-13]


At first this looks promising,

do.call(rbind, head(strsplit(val, "\\s+")))
[,1] [,2]
[1,] "PD2323" "[403-407]"
[2,] "P05230" "[455-459]"
[3,] "AS14641" "[168-178]"
[4,] "SS7Z3Z4" "[424-428]"
[5,] "QQN4C6-2" "[671-679]"
[6,] "DD9Y3B2" "[7-13]"


if I now remove the head function I end up getting something with 90 columns for some reason

dim(do.call(rbind, strsplit(val, "\\s+")))

[1] 23369 90
Warning message:
In .Method(..., deparse.level = deparse.level) :
number of columns of result is not a multiple of vector length (arg 314)

Answer

We can use gsub to remove the square brackets along with - and read into a data.frame with read.table

d1 <- read.table(text=gsub("[][]|-", " ", val), header=FALSE, stringsAsFactors=FALSE)
d1 
#    V1  V2  V3
#1 PD2323 403 407
#2 P05230 455 459

data

val <- c( "PD2323 [403-407]",   "P05230 [455-459]")
Comments