user976991 user976991 - 29 days ago 11
R Question

Removing lines with crossed info of columns from data frame comparing two colums

This is one of my particular nightmares when I'm trying to merge different gene expression results according with pair genes conditions, here is my merged data frame:

knowngene1 Logfold1 Gene1 knowngene2 Logfold2 Gene2
uc001ezv.3 5.1167021111 NA uc001ezu.1 5.6262305191 FLG
uc001ihe.4 4.1338871783 LOC100216001 uc001ihg.3 3.9475325801 NA
uc001iki.4 9.9902455211 CELF2 uc001ikn.2 9.3321964303 NA
uc001ikk.2 10.3059806111 CELF2 uc001ikn.2 9.3321964303 NA
uc001ikl.4 9.9890468379 CELF2 uc001ikn.2 9.3321964303 NA
uc001ikn.2 9.8293484977 NA uc001iki.4 9.4401488053 CELF2
uc001ikn.2 9.8293484977 NA uc001ikk.2 9.2887954663 CELF2
uc001ikn.2 9.8293484977 NA uc001ikl.4 9.4401488053 CELF2
uc001ikn.2 9.8293484977 NA uc010qbi.2 8.6399349792 CELF2
uc001ikn.2 9.8293484977 NA uc010qbj.1 9.2887954663 CELF2
uc001ezu.1 5.6262305191 FLG uc001ezv.3 5.1167021111 NA
uc001ihg.3 3.9475325801 NA uc001ihe.4 4.1338871783 LOC100216001
uc001iki.4 9.4401488053 CELF2 uc001ikn.2 9.8293484977 NA
uc001ikk.2 9.2887954663 CELF2 uc001ikn.2 9.8293484977 NA
uc001ikl.4 9.4401488053 CELF2 uc001ikn.2 9.8293484977 NA
uc001ikn.2 9.3321964303 NA uc001iki.4 9.9902455211 CELF2
uc001ikn.2 9.3321964303 NA uc001ikk.2 10.3059806111 CELF2
uc001ikn.2 9.3321964303 NA uc001ikl.4 9.9890468379 CELF2
uc001ikn.2 9.3321964303 NA uc010qbi.2 10.3865530025 CELF2
uc001ikn.2 9.3321964303 NA uc010qbj.1 10.3072927485 CELF2
uc001iot.1 6.9068905956 NA uc001iou.2 8.4040043896 VIM
uc001iou.2 10.4420548632 VIM uc001iot.1 5.8235197903 NA
uc001ipd.3 4.4693510978 ST8SIA6 uc001ipf.1 5.1931857169 NA
uc001kgd.3 3.5469561781 NA uc009xts.3 4.0607448636 IFIT2
uc001kgf.3 3.3975573789 IFIT3 uc001kgd.3 3.2512633588 NA


The point is that I want to remove not the duplicated lines, of course there are not, I want to remove those which have the knowngene accessor changed in knowngene1 and knongene2 as well. Let me show an example, the first one is the line I want to keep

uc001ikn.2 9.8293484977 NA uc001iki.4 9.4401488053 CELF2


these next lines for me are the same, in fact the first one is the specular image of the one I want to keep, despite its expression values, which more or less are in the same range

uc001iki.4 9.4401488053 CELF2 uc001ikn.2 9.8293484977 NA
uc001ikn.2 9.3321964303 NA uc001ikl.4 9.9890468379 CELF2


So the idea is to keep ONLY the first one I see and remove the next ones. Do you have any ideas?

Answer

So you want to remove all rows where uc001ikn.2 appears? If so,I think this will work:

Rgames> foo
     [,1] [,2]
[1,]    1    7
[2,]    2    8
[3,]    3    9
[4,]    2    3
[5,]    4    1
[6,]    3   10
[7,]    5   11
[8,]    6   12
Rgames> foo[!duplicated(foo[,1])&!(foo[,2]%in%duplicated(foo[,1])),]
     [,1] [,2]
[1,]    1    7
[2,]    2    8
[3,]    3    9
[4,]    5   11
[5,]    6   12

Where in your case, you'd operate on df$knowngene1 and df$knowngene2 columns.