mks mks - 2 months ago 11
R Question

R: How to use setdiff on two string vectors by only comparing the first 3 tab delimited items in each string? without using qdap

ive previously asked this question and the answer i received worked: R: How to use setdiff on two string vectors by only comparing the first 3 tab delimited items in each string?,
However the qdap library is not stable, and needs some manual configuration on an individuals machine in order to make it work.cannot load R package qdap. So now i am re-asking the question but am wondering if there is a way to do this without using qdap? i will repeat the question below:

I am trying to figure out a way in R to take the difference of two string vectors, but only based on the first 3 columns that are tab delimited in each string. For Example this is list1 and list2

list1:

"1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n"
"1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"
"1\t1180200\t1187599\t1\t1177632\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"


list2:

"1\t1113200\t1118399\t1\t1101465\t1120176\tENSRNOG00000040300\tRaet1l\t0\n"
"1\t1180200\t1187599\t1\t1177682\t1221416\tENSRNOG00000061316\tAABR07000121.1\t0\n"


i want to do setdiff(list2,list1) , so that i just get everything in list2 that is non-existent in list1, however i want to do it based on just the first 3 tab delimited strings. So in list1 i would only consider:

"1\t1113200\t1118399"


from the first entry. However i still want the full string returned. I only want to compare using the first 3 columns. I am having trouble figuring out how to do this, any help would be appreciated. Ive already looked at several SO posts, none of them seemed to help.

Answer

Looks like you just need to extract up to the third tab character (to get the first three columns) from list1 and compare that to the same in list2?

There are quite a few ways to do this in base R, here's one using regular expressions to extract the first three tabs:

# first, let's get the first 3 columns of `list1` (get up to the third tab)
m = regexec("^(?:[^\t]+\t){3}", list1)
# you'll see it's a list with the first 3 columns of each thing in `x`
first3.list1 = unlist(regmatches(list1, m))

Now we have the first three columns we can match against list2. You can extract the first three columns of list2 similarly and use %in% like the answer to your previous question now. (setdiff will only return the non-matching first 3 columns, while using %in% can be used to index the original list2 to extract the entire original string)

m = regexec("^(?:[^\t]+\t){3}", list2)
first3.list2 = unlist(regmatches(list2, m))
list2[!(first3.list2 %in% first3.list1)]

(It seems for the example you provided, there are no lines in list2 whose first 3 columns are not in the first 3 columns of list1).


Other approaches include using strsplit or read.delim to split your dataframe into columns, then using paste to paste the first 3 back together, and then proceeding similarly.