runjumpfly runjumpfly - 1 month ago 11
R Question

Merging two data frames with columns with certain patterns in strings

(I have been stuck with this problem for past two days, So if it has an answer on SO please bear with me.)

I have two data frames A and B. I want to merge them on Name column. Suppose, A has two columns Name and Numbers. The Name column of A df has values ".tony.x.rds", ".tom.x.rds" and so on.

Name Numbers
.tony.x.rds 15.6
.tom.x.rds 14.5


The B df has two columns Name and ChaR. The Name column of B has values "tony.x","tom.x" and so on.

Name ChaR
tony.x ENG
tom.x US


The main element in column Name of both dfs is "tony', "tom" and so on.


So, ".tony.x.rds" is equal to "tony.x" and ".tom.x.rds" is equal to "tom.x".


I have tried gsub with various option leaving me with 'tony", "tom", and so on in column Name of both A and B data frames. But when I use

StoRe<-merge(A,B, all=T)


I ge all the rows of A and B rather than single rows. That is, there are two rows for each "a", "b" and so on for with their respective values in Numbers and ChaR column. For example:

Name Numbers ChaR
tony 15.6 NA
tony NULL ENG
tom 14.5 NA
tom NULL US


It has been giving me splitting headache. I request you to help.

Answer

One possible solution. I am not completely sure what you want to do with the 'x' in the strings, I have kept them in the linkage key, but by changing the \\1\\2 to \\1 you keep only the first letter.

a <- data.frame(
  Name = paste0(".", c("tony", "tom", "foo", "bar", "foobar"), ".x.rds"),
  Numbers = rnorm(5)
)

b <- data.frame(
  Name = paste0(c("tony", "tom", "bar", "foobar", "company"), ".x"),
  ChaR = LETTERS[11:15]
)

# String consists of 'point letter1 point letter2 point rds'; replace by
# 'letter1 letter2' 
a$Name_stand <- gsub("^\\.([a-z]+)\\.([a-z]+)\\.rds$", "\\1\\2", a$Name)

# String consists of 'letter1 point letter2'; replace by 'letter1 letter2' 
b$Name_stand <- gsub("^([a-z]+)\\.([a-z]+)$", "\\1\\2", b$Name)

result <- merge(a, b, all = TRUE, by = "Name_stand")

Output:

#> result
#  Name_stand        Name.x     Numbers    Name.y ChaR
#1       barx    .bar.x.rds  1.38072696     bar.x    M
#2   companyx          <NA>          NA company.x    O
#3    foobarx .foobar.x.rds -1.53076596  foobar.x    N
#4       foox    .foo.x.rds  1.40829287      <NA> <NA>
#5       tomx    .tom.x.rds -0.01204651     tom.x    L
#6      tonyx   .tony.x.rds  0.34159406    tony.x    K

Another, perhaps somewhat more robust (to variations of the strings such as 'tom.rds' and 'tom' which will still be linked; this can of course also be a disadvantage)/

# Remove the rds from a$Name
a$Name_stand <- gsub("rds$" , "", a$Name)
# Remove all non alpha numeric characters from the strings
a$Name_stand <- gsub("[^[:alnum:]]", "", a$Name_stand)
b$Name_stand <- gsub("[^[:alnum:]]", "", b$Name)

result2 <- merge(a, b, all = TRUE, by = "Name_stand")