mks mks - 2 months ago 14
R Question

R: How to use setdiff on two string vectors by only comparing the first 3 tab delimited items in each string?

I am trying to figure out a way in R to take the difference of two string vectors, but only based on the first 3 columns that are tab delimited in each string. For Example this is list1 and list2

list1:

"1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n"
"1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
"1\t1180200\t1187599\t1\t1177632\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"


list2:

"1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n"
"1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"


i want to do
setdiff(list2,list1)
, so that i just get everything in list2 that is non-existent in list1, however i want to do it based on just the first 3 tab delimited strings. So in list1 i would only consider:

"1\t1113200\t1118399"


from the first entry. However i still want the full string returned. I only want to compare using the first 3 columns. I am having trouble figuring out how to do this, any help would be appreciated. Ive already looked at several SO posts, none of them seemed to help.

Answer

For extracting the first three columns (not sure why you need this as a long string rather than a dataframe...), I would use beg2char() from the qdap library. (Although, if they are all the same length base substr() will work fine.)

beg2char(list1, '\t', 3) # Will extract from the beginning up to the third tab delimiter

Then rather than setdiff I would simply use %in% to check if the substring of the element in list2 matches any of the elements in list1.

beg2char(list2, '\t', 3) %in% beg2char(list1, '\t', 3) # will give you TRUE/FALSE
list2[!(beg2char(list2, '\t', 3) %in% beg2char(list1, '\t', 3))]

Will give the the full elements of list2 that have substring that are nonexistent in list1.

Comments