user1981275 user1981275 - 3 months ago 21
R Question

Split string on special character

I have a string (fasta format)

a = ">atttaggacctta\nattgtcggta\n>ccattnnnn\ncccatt\n>ttaggccta"


and would like to seperate at character
>
, filter out the newlines and put the thre substrings seperated by
>
into a vector or list with three elements:

>atttaggaccttaattgtcggta

>ccattnnnncccatt

>ttaggccta


I tried
strsplit
:

unlist(strsplit(a, "(?<=>)", perl=T))


but this puts the delimiter
>
at the end of the each string.

I found related questions are here or here but I can't really get t to work without making a complicated construct.

Is there a simple solution to do this on one go?

Answer

Your regex only contains a lookbehind that matches any empty location after a >, see your regex demo. The engine processes a string from left to right, checks if there is a > to the left of the current location, and then returns a valid empty string match if < is found.

You may use (?<=[^>])(?=>) regex:

> res <- unlist(strsplit(a, "(?<=[^>])(?=>)", perl=T))
> res
[1] ">atttaggacctta\nattgtcggta\n" ">ccattnnnn\ncccatt\n"        
[3] ">ttaggccta"  
> gsub("\n", "", res, fixed=TRUE)
[1] ">atttaggaccttaattgtcggta" ">ccattnnnncccatt"        
[3] ">ttaggccta"  

The pattern matches a location that is preceded with a non-> char and is followed with > char.

Note that using a lookbehind pattern only with strsplit often leads to unexpected behavior. See Why does strsplit use positive lookahead and lookbehind assertion matches differently?

Comments