Gin_Salmon Gin_Salmon - 2 months ago 17
R Question

Checking if a substring occurs in a separate column

I need to check whether my data contains names that a benchmark study doesn't.
The problem that I have is that my names are truncated, so are generally a substring of the names reported by the benchmarking study.

For example, a truncated version of my data:

Benchmark `My Data`
Southern Corp Avista
American Electric Power Atmos
Atmos Energy Southern
Delta Natural Gas Co Inc DUET
Avista Corp Duke Energy
Duke Energy Corp American Electric
Pepco Holdings Nextra Energy


For example, you can see that in "My Data", the name is reported as "Southern", but in the Benchmark it is reported as "Southern Corp", similarly, my data will report "Duke Energy" as "Duke Energy Corp".

I need to find out whether my data set contains any names that the benchmarking set doesn't.

So I'd need a flag or a helper column which tells me that my data set includes "Nextra Energy", but the Benchmarking doesn't.

I want the output like:

Benchmark `My Data` Not in Benchmark
Southern Corp Avista 0
American Electric Power Atmos 0
Atmos Energy Southern 0
Delta Natural Gas Co Inc DUET 1
Avista Corp Duke Energy 0
Duke Energy Corp American Electric 0
Pepco Holdings Nextra Energy 1


Thanks!

Answer
dat$flag <- NA
for(i in 1:nrow(dat)){
  if(any(grepl(dat$Data[i],dat$Benchmark))){
    dat$flag[i] <- 0
  } else{
    dat$flag[i] <- 1
  }
}

dat
                 Benchmark              Data flag
1            Southern Corp            Avista    0
2  American Electric Power             Atmos    0
3             Atmos Energy          Southern    0
4 Delta Natural Gas Co Inc              DUET    1
5              Avista Corp       Duke Energy    0
6         Duke Energy Corp American Electric    0
7           Pepco Holdings     Nextra Energy    1