pill45 pill45 - 3 months ago 7
R Question

How to insert a column between two column data frame, with match condition, in R?

I have two data frames,

df1

Identifier GSE1028888 GSE1034555
100002 0.1 0.2
100003 0.3 0.4
...... ..... .....
100007 0.9 1.1


df2

V3 V2
100002 XLX12
100003 ABorF
...... .....
110000 GEF22


Now I want insert the V2 information into df1, such as

df3

Identifier New_V2 GSE1028888 GSE1034555
100002 XLX12 0.1 0.2
100003 ABorF 0.3 0.4
100004 NA 0.6 0.7
...... ..... .....
100007 ccL34 0.9 1.1


The V3 of df2 and Identifier of df1 have different length.
I try dplyr left_join, but the column is attached at the end.

This is the code to create similar data frame

df1 <- data.frame("Identifier" = sample(100001:100010, 6, replace = F),
y = sample(rnorm(10), 6, replace = T),
z = sample(rnorm(10), 6, replace = T))
df2 <- data.frame(V1 = c(100001:100010),
V2 = sample(state.name, 10, replace = T))


This time when I try dplyr left_join,

left_join(df1, df2, by =c("Identifier"="V3"))


an error message was shown

Error: cannot join on columns 'V3' x 'Identifier': index out of bounds


Anyone has any idea?

Answer

Using the data you provided:

df1 <- data.frame("Identifier" = sample(100001:100010, 6, replace = F), 
               y = sample(rnorm(10), 6, replace = T), 
               z = sample(rnorm(10), 6, replace = T))
df2 <- data.frame(V1 = c(100001:100010), 
              V2 = sample(state.name, 10, replace = T))

merge(df2, df1, by.x="V1",by.y="Identifier", all.x=TRUE)
       V1            V2          y          z
1  100001     Wisconsin -1.9468637  0.4509951
2  100002      Nebraska  2.5155819  0.4509951
3  100003          Ohio         NA         NA
4  100004 Massachusetts         NA         NA
5  100005       Montana -1.9468637 -2.1825878
6  100006      Illinois -0.1591367  0.3445637
7  100007    New Mexico -0.5696300  0.4509951
8  100008    New Jersey         NA         NA
9  100009     Tennessee         NA         NA
10 100010    Washington -1.9468637 -0.5402241

And you can toggle the order which ever way you like but the above output is what you requested -no need to reorder but if wanted to:

merge(df2, df1, by.x="V1",by.y="Identifier", all.x=TRUE)[c(2,3,4,1)]
              V2          y          z     V1
1      Wisconsin -1.9468637  0.4509951 100001
2       Nebraska  2.5155819  0.4509951 100002
3           Ohio         NA         NA 100003
4  Massachusetts         NA         NA 100004
5        Montana -1.9468637 -2.1825878 100005
6       Illinois -0.1591367  0.3445637 100006
7     New Mexico -0.5696300  0.4509951 100007
8     New Jersey         NA         NA 100008
9      Tennessee         NA         NA 100009
10    Washington -1.9468637 -0.5402241 100010

As per the OP's posted comments below, here is an example where an identifier exists in df1 that does not exist in df2

df1 <- data.frame("Identifier" = sample(100001:100012, 6, replace = F), 
               y = sample(rnorm(10), 6, replace = T), 
               z = sample(rnorm(10), 6, replace = T))

df1
  Identifier           y            z
1     100011 -1.60532712  1.365836073
2     100007 -1.28821500  0.005925986
3     100004 -0.03444609  0.780708952
4     100006  0.32190045  0.780708952
5     100009 -1.60532712 -1.471916384
6     100005 -0.76985033  0.191956916

df2 <- data.frame(V1 = c(100001:100010), 
              V2 = sample(state.name, 10, replace = T))
df2
       V1            V2
1  100001  Pennsylvania
2  100002 West Virginia
3  100003          Utah
4  100004        Alaska
5  100005          Ohio
6  100006   Mississippi
7  100007 New Hampshire
8  100008    New Jersey
9  100009          Ohio
10 100010       Georgia

merge(df2, df1, by.x="V1",by.y="Identifier", all.x=TRUE, all.y=TRUE)
       V1            V2           y            z
1  100001  Pennsylvania          NA           NA
2  100002 West Virginia          NA           NA
3  100003          Utah          NA           NA
4  100004        Alaska -0.03444609  0.780708952
5  100005          Ohio -0.76985033  0.191956916
6  100006   Mississippi  0.32190045  0.780708952
7  100007 New Hampshire -1.28821500  0.005925986
8  100008    New Jersey          NA           NA
9  100009          Ohio -1.60532712 -1.471916384
10 100010       Georgia          NA           NA
11 100011          <NA> -1.60532712  1.365836073
Comments