Craig Hamilton Craig Hamilton - 4 months ago 25
R Question

Extract data from multiple columns in R data frame, then searching another

I have a central data frame of information (df3) that I'm trying to subset and add columns to based on data extracted from several columns of another (df2), that itself comes from a subset of a third (df1). I've managed to get so far by searching help and playing around with various functions, but I have reached an impasse. I do hope you can help.

To begin with, the 3dfs are structured as follows:

#df1 - my initial search database
id <- c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8")
yesno <- c("Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "No")
city <- c("London", "London", "Paris", "London", "Paris", "New York", "London", "London")
df1 <- cbind(id, yesno, city)
df1 <- as.data.frame(df1)
df1

#df2 - containing the data needed to search df3, but situated across columns
id <- c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8")
twitter <- c("@one","", "@three", "@four", "", "", "@seven", "")
email <- c("", "", "", "add4", "add5","", "add7", "")
mail <- c("", "postcode2", "", "","","","","postcode8")
df2 <- cbind(id, twitter, email, mail)
df2 <- as.data.frame(df2)
df2

#df3 - the central df containing the data I wish to extract
comms <- c("@one", "postcode2", "@three", "@four", "add4", "add5", "six" "@seven", "add7", "postcode2")
target <- c("text1", "text2", "text3", "text4.1", "text4.2", "text5", "text6", "text7.1","text7.2", "text8")
df3 <- cbind(comms,target)
df3 <- as.data.frame(df3)
df3


The commonality between df1 and df2 is found in the id columns. I've so far been able to filter df1 and extract the ids, which I've then used to subset df2.

df_search <- df1 %>%
filter(yesno == "Yes", city == "London")

df_search_ids <- df_search$id

df2_search <- df2 %>%
filter(id %in% df_search_ids)
df2_search

id twitter email mail
1 id1 @one
2 id2 postcode2
3 id4 @four add4
4 id7 @seven add7


My problems are: the common data between df2 and df3 are spread across three different columns of df2 (twitter, email and mail); these columns contain blank cells and other extraneous info (e.g. 'I am not on Twitter'); and finally that some of the entries in df2 (such as id4 and id7 above) have more than one entry in df3.

The solution I am trying to reach is that I would like to extract all instances from the columns twitter, email and mail of df2 based on a match with the ids extracting from df1, so that the extracted info can then be applied to subset df3 and eventually results in a new df(target_res) that looks like this:

id_res <- c("id1", "id2", "id4", "id4", "id7", "id7")
comms_res <- c("@one", "postcode2", "@four", "add4", "@seven", "add7")
target_res <- c("text1", "text2", "text4.1", "text4.2", "text7.1", "text7.2")
result_df <- cbind(id_res, comms_res, target_res)
result_df <- as.data.frame(result_df)
result_df

id_res comms_res target_res
1 id1 @one text1
2 id2 postcode2 text2
3 id4 @four text4.1
4 id4 add4 text4.2
5 id7 @seven text7.1
6 id7 add7 text7.2


This is an action I will be performing a number of times (based on different explorations of df1), so ideally would be replicable.

I hope this is a clear explanation of the issue.

Answer

The key is to use tidyr::gather to gather the twitter:mail columns (from your filtered df2_search) as rows under a new column comms and then filter again to remove the empty "" rows. Your second pipe can then be:

library(dplyr)

result <- df2 %>% filter(id %in% df_search_ids) %>% 
                  gather("source","comms",twitter:mail) %>% 
                  filter(comms != "") %>%
                  inner_join(df3, by="comms") %>% 
                  select(id_res=id,comms_res=comms,target_res=target) %>%
                  arrange(id_res)

The look up for df3 is then an inner_join by comms, which keeps only the rows matched in both data frames. The rest is formatting the output result.

With this you should get with your input:

print(result)
##  id_res comms_res target_res
##1    id1      @one      text1
##2    id2 postcode2      text2
##3    id2 postcode2      text8
##4    id4     @four    text4.1
##5    id4      add4    text4.2
##6    id7    @seven    text7.1
##7    id7      add7    text7.2
##Warning messages:
##1: attributes are not identical across measure variables; they will be dropped 
##2: In inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) :
##  joining character vector and factor, coercing into character vector

The warnings are because your comms are factors instead of characters. See this for a way to get rid of that.

Comments