Marie-Eve Marie-Eve - 3 months ago 7
R Question

Checking whether a string is present in a bunch of other strings by row and expand columns to sign this test

I would like to have a data frame marked if a string from a vector of strings is present or not in a given column of a data frame by row. The following is a toy data and next is how I would like the outcome to be. It can go ok with loops, but if possible, I'd like to not use loop, once this data is about 3 million rows.

mydata <- structure(list(X7 = c("00019", "00019", "00019", "00019", "00035", "00035"), X17 = c("A / BG / C / D / E", "E / D", "B / F", "B / C", "A / BE / G / F", "AB / G" ), n = c(10L, 4L, 4L, 4L, 8L, 4L)), .Names = c("X7", "X17", "n"), row.names = c(NA, -6L), class = c("data.frame"))


.

> mydata
X7 X17 n
1 00019 A / BG / C / D / E 10
2 00019 E / D 4
3 00019 B / F 4
4 00019 B / C 4
5 00035 A / BE / G / F 8
6 00035 AB / G 4


In the outcome data the columns can go until the last letter of alphabet, here I just print a subset from it.

> outcome
X7 X17 n A B C D E F G
1 00019 A / BG / C / D / E 10 1 0 1 1 1 0 0
2 00019 E / D 4 0 0 0 1 1 0 0
3 00019 B / F 4 0 1 0 0 0 1 0
4 00019 B / C 4 0 1 1 0 0 0 0
5 00035 A / BE / G / F 8 1 0 0 0 0 1 1
6 00035 AB / G 4 0 0 0 0 0 0 1

lmo lmo
Answer

Here is one method using sapply and grepl:

outcome2 <- cbind(mydata, sapply(LETTERS[1:7], function(i) as.integer(grepl(i, mydata$X17))))

sapply loops through the letters A-G created by LETTERS[1:7]. grepl checks if each letter is present in a row of mydata$X17 and is transformed from a logical (TRUE / FALSE) to a binary integer (0 /1) with as.integer.

# test that the outcomes are the same
identical(outcome, outcome2)
[1] TRUE