Learner Algorithm Learner Algorithm - 23 days ago 7
R Question

how can I compare two columns of strings together

I have two df, one is df1

df1<- structure(list(V1 = structure(c(1L, 2L, 3L, 7L, 5L, 6L, 4L, 9L,
8L), .Label = c("A0A061ACH4;Q95Q10;Q9U1W6", "A0A061ACL3;Q965I6;O76618",
"A0A061ACR1;Q2XN02;F5GUA3;Q22498", "A0A061AJJ3;A0A061AEA8", "A0A061AL01",
"C1P641", "H2FLH3;H2FLH2;A0A061ACT3;A0A061AE24;Q23551-2;Q23551;Q23551-4;Q23551-3;Q23551-5",
"Q22501;A0A061AE05", "Q86CZ7"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-9L))


and the other is df2

df2 <- structure(list(V1 = structure(c(1L, 2L, 3L, 6L, 5L, 4L, 8L, 9L,
7L), .Label = c("A0A061ACH4;Q95Q10;Q9U1W6", "A0A061ACL3;Q965I6;O76618",
"A0A061ACR1;Q2XN02;F5GUA3;Q22498", "A0A061AJJ3;A0A061AEA8", "A0A061AL01",
"H2FLH3;H2FLH2;A0A061ACT3;A0A061AE24;Q23551-2;Q23551;Q23551-4;Q23551-3;Q23551-5",
"Q22501;A0A061AE05", "Q27GQ4", "Q86CZ7"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-9L))


I want to compare these two from each other line by line.
which line is similar from df1 to df2 and vice versa

then make an output with all unique lines from both df1 and df2 (means all lines from those two df in one new df)

Then in front of those lines that we don't have in df1 but we have in df2 , we mention a zero and the same for df2.

an expected output can be like below

output<- structure(list(V1 = structure(c(1L, 2L, 3L, 4L, 8L, 6L, 7L, 5L,
10L, 11L, 9L), .Label = c("", "A0A061ACH4;Q95Q10;Q9U1W6", "A0A061ACL3;Q965I6;O76618",
"A0A061ACR1;Q2XN02;F5GUA3;Q22498", "A0A061AJJ3;A0A061AEA8", "A0A061AL01",
"C1P641", "H2FLH3;H2FLH2;A0A061ACT3;A0A061AE24;Q23551-2;Q23551;Q23551-4;Q23551-3;Q23551-5",
"Q22501;A0A061AE05", "Q27GQ4", "Q86CZ7"), class = "factor"),
V2 = structure(c(3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L,
1L), .Label = c("", "0", "df1"), class = "factor"), V3 = structure(c(3L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("", "0",
"df2"), class = "factor")), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-11L))


Here
Q27GQ4
does not exist in df1 but exist in df2, so in the column of df1 in the output we mention zero
Here
C1P641
exist in df1 but not in df2 so in the column of df2 in the output we mention zero.

I would appreciate any help since I am very new in R and I could not figure it out how to do it

Answer

Try this out:

op <- merge(df1,df2, 
      all.x = TRUE,
      all.y = TRUE) 

op$df1 <- 1*(op$V1 %in% df1$V1)

op$df2 <- 1*(op$V1 %in% df2$V1)

> op
                                                                               V1 df1 df2
1                                                        A0A061ACH4;Q95Q10;Q9U1W6   1   1
2                                                        A0A061ACL3;Q965I6;O76618   1   1
3                                                 A0A061ACR1;Q2XN02;F5GUA3;Q22498   1   1
4                                                           A0A061AJJ3;A0A061AEA8   1   1
5                                                                      A0A061AL01   1   1
6                                                                          C1P641   1   0
7  H2FLH3;H2FLH2;A0A061ACT3;A0A061AE24;Q23551-2;Q23551;Q23551-4;Q23551-3;Q23551-5   1   1
8                                                               Q22501;A0A061AE05   1   1
9                                                                          Q86CZ7   1   1
10                                                                         Q27GQ4   0   1

OR

library(dplyr)

op <- merge(df1,df2, 
             all.x = TRUE,
             all.y = TRUE) %>% 
        mutate(df1=1*(V1 %in% df1$V1),
               df2=1*(V1 %in% df2$V1))

And here are the answers for your extra questions:

-know how many lines from df1 and df2 are similar?

sum(df1$V1 %in% df2$V1) 

-how many of df1 exist which don't exist in df2?

sum(!(df1$V1 %in% df2$V1))

-how many of df2 exist which don't exist in df1?

sum(!(df2$V1 %in% df1$V1))
Comments