Komal Rathi Komal Rathi - 3 months ago 13
R Question

Intersect dataframe on multiple columns

I have 2 dataframes:

dat: 1900 obs of 9 variables

V1 V2 V3 V4 V5 V6 V7 V8 V9
1 V_P50P50_Q3 chr12 106642383 106642395 + 18.1425 4.03e-08 0.0515 GGGGGACTCCCCC
2 V_P50RELAP65_Q5_01 chr8 142276666 142276677 - 16.6429 2.51e-07 0.2780 GGGATTTCCCAC
3 V_RELA_Q6 chr22 51020067 51020078 - 15.9395 2.71e-07 0.3350 GGGAATTTCCCC
4 V_NFKB_Q6_01 chr14 98601454 98601469 + 17.0684 3.08e-07 0.236 GGAGTGGAAATTCC
5 V_CREL_Q6 chr22 51020068 51020079 - 16.1165 3.19e-07 0.4050 AGGGAATTTCCC


dat.markov: 1486 obs of 9 variables

V1 V2 V3 V4 V5 V6 V7 V8 V9
1 V_NFKB_Q6_01 chr14 98601454 98601469 + 17.2212 1.33e-07 0.146 GGAGTGGAAATTCCCT
2 V_P50P50_Q3 chr12 106642383 106642395 + 16.9358 1.57e-07 0.201 GGGGGACTCCCCC
3 V_CREL_Q6 chr22 51020068 51020079 - 16.0549 2.29e-07 0.292 AGGGAATTTCCC
4 V_NFKB_Q6_01 chr22 51020064 51020079 + 16.9906 2.32e-07 0.146 TTGGGGGAAATTCCCT
5 V_RELA_Q6 chr22 51020067 51020078 - 15.7496 3.42e-07 0.433 GGGAATTTCCCC


I need to merge the two data frames such that I get all the rows with matching columns V1, V2, V3 and V4 between the two data.frames.

I tried:

y<-merge(dat,dat.markov,by=c("V1","V2","V3","V4"))


which gives me a merged dataframe but with 1513 obs. But technically, the number of observations should be equal to or less than the smaller dataframe i.e. 1486 obs.

My merged data.frame looks alright in terms of number of columns returned:

V1 V2 V3 V4 V5.x V6.x V7.x V8.x V9.x V5.y
1 V_CREL_01 chr10 112778464 112778473 + 12.9434 1.94e-05 0.694 TGGGTTTTCC +
V6.y V7.y V8.y V9.y
1 12.8838 2.35e-05 0.788 TGGGTTTTCC


I know you can intersect the data.frames using one column but is there a way in which you can intersect two data.frames on multiple columns?

Answer

If I understand, you want columns 1,2,3,4 to be perfectly matched in both frames in order to keep them? Besides merge, I would use the interaction and match functions

dat$combine = as.character(interaction(dat$V1, dat$V2, dat$V3, dat$V4))
dat.markov$combine = as.character(interaction(dat.markov$V1, dat.markov$V2, dat.markov$V3, dat.markov$V4))

dat.overlap = dat[complete.cases(match(dat.markov$combine, dat$combine)),]
Comments