BioMan - 1 year ago 70
R Question

# find strings common in at least two data sets

I have 4 data sets (a,b,c,d) and I want to find strings that are present in at least two of the data sets. My data looks like this:

`````` head(a)
``````

``````[1] MLH3      PCSK7     PKMYT1    C14orf132 ANP32A    POLQ
1634 Levels: AARS ABAT ABCA8 ABCC9 ABCE1 ABHD3 ABHD5 ABL1 ABLIM1 ACADVL ACAN ACAT2 ACBD3     ACD ACLY ACOT2 .
``````

``````head(b)
``````

``````[1] ZCCHC10  DYNLL1   ERBB2IP  C17orf75 BUB1B    PLK1
1311 Levels: AASDHPPT ABAT ABCA6 ABCG1 ABI1 ACAA1 ACACB ACO2 ACOX1 ACSL1 ACSL3 ACSL4     ACTR6 ADAMTS1 ADCYAP1R1 ... ZRANB2
``````

``````head(c)
``````

``````[1] UBE2Q1    PCSK9     ZDHHC11   GMDS      PPP2R3B   C20orf117
1247 Levels: ABCC2 ABCC5 ABCF1 ABCG1 ABHD14B ABHD5 ABL1 ABLIM2 ABTB2 ACAD8 ACD ACO1 ACOT9 ACSL3 ACSS2 ACTA2 ... ZYG11B
``````

``````head(d)
``````

``````[1] UBE2Q1    PCSK9     ZDHHC11   GMDS      PPP2R3B   C20orf117
1247 Levels: ABCC2 ABCC5 ABCF1 ABCG1 ABHD14B ABHD5 ABL1 ABLIM2 ABTB2 ACAD8 ACD ACO1 ACOT9 ACSL3 ACSS2 ACTA2 ... ZYG11B
``````

I am thinking of using the
`intersect()`
function in R

You could create a list of the unique elements in each of your four vectors and just return the duplicated elements, which are the elements appearing in two or more of the vectors:

``````all.vals <- c(unique(a), unique(b), unique(cc), unique(d))
unique(all.vals[duplicated(all.vals)])
# [1] "UBE2Q1"    "PCSK9"     "ZDHHC11"   "GMDS"      "PPP2R3B"   "C20orf117"
``````

Note that I renamed your third vector to be `cc` so you didn't overwrite the built-in function `c`:

``````a <- c("MLH3", "PCSK7", "PKMYT1", "C14orf132", "ANP32A", "POLQ")
b <- c("ZCCHC10", "DYNLL1", "ERBB2IP", "C17orf75", "BUB1B", "PLK1")
cc <- c("UBE2Q1", "PCSK9", "ZDHHC11", "GMDS", "PPP2R3B", "C20orf117")
d <- c("UBE2Q1", "PCSK9", "ZDHHC11", "GMDS", "PPP2R3B", "C20orf117")
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download