user3010126 user3010126 - 4 months ago 25
R Question

Searching for target in Excel spreadsheet using R

As an R noob, I'm currently rather stumped by what is probably a rather trivial problem. I have data that looks like in the second image below, essentially a long sheet of rows with values in three columns. What I need is for a way to scan the sheet looking for particular combinations of values in the first and second column - combinations that are specified in a second spreadsheet of targets (see picture 1). When that particular combination is found, I need the script to extract the whole row in question from the data file.

So far, I've managed to read the files without problem:


folder <- 'C:\\Users\\...\\Desktop\\R EXCEL test\\'

target_file <- paste(folder,(readline(prompt = "Enter filename for target list:")),sep = "")

data_file <- paste(folder,(readline(prompt = "Enter data file:")),sep = "")

targetsDb <- read.xlsx(target_file, sheetName = "Sheet1")
data <- read.xlsx(data_file, sheetName = "Sheet1")

targets <- vector(mode = "list", length = 3)

for(i in 1:nrow(targetsDb)){
targets[[i]] <- c(targetsDb[i,1],targetsDb[i,2])

And with the last command I've managed to save the target combinations as items in a list. However, I run into trouble when it comes to iterating through the file looking for any of those combinations of cell values in the first two columns. My approach was to create a list with one item,

SID_IA <- vector(mode = "list", length = 1)

and to fill it with the values of column 1 and 2 iteratively for each row of the data file:

for(n in 1:nrow(data)){
SID_IA[[n]] <- c(data[n,1],data[n,2])

I would then nest another for loop here, which basically goes through every row in the targets sheet to check if the combination of values currently in the SID_IA list matches any of the target ones. Then at the end of the loop, the list is emptied so it can be filled with the following combination of data values.

for(i in targets){
if(SID_IA[[n]] %in% targets){
print(SID_IA[[n]], "in sentence" , data[n,1], "is ", data[n,3])
SID_IA[[n]] <- NULL

However, if I try to run that last loop, it returns the following output and error:

Error in SID_IA[[n]] : subscript out of bounds
In addition: Warning message:
In if (SID_IA[[n]] %in% targets) { :
the condition has length > 1 and only the first element will be used

So, it seems to be doing something for at least one iteration, but then crashes. I'm sure I'm missing something very elementary, but I just can't see it. Any ideas?

EDIT: As requested, I've removed the images and made the test Excel sheets available here and here.


OK.. I'm attempting an answer that should require minimum use of fancy tricks.

data<- xlsx::read.xlsx(file = "Data.xlsx",sheetIndex = 1)
target<-  xlsx::read.xlsx(file = "Targets.xlsx",sheetIndex = 1)


These values are already in data.frame format. If all you want to know is which rows appear exactly same in data and target, then it will be as simple as finding a merge

merge(target,data,all = F)

If, on the other hand , you want to keep the data table with a marking of target rows, then the easiest way will be to make an index column

data$indx<- 1:nrow(data)
mrg<- merge(target,data,all = F)
data$test<- rep("test", nrow(data))
data$test[mrg$indx]<- "target"


This is like the original image you'd posted.

BTW , if yo are on a graphical interface you can also use File dialogue to open data files.. check out file.choose()