Darren Darren - 10 days ago 5
R Question

R require row sums for occurrences of regex pattern that can occur multiple times in individual cells

I use r, and I'm looking to use regular expressions to calculate the row sums for the amount of occurrences of a string pattern that occurs across all columns in data frame containing epigenetic information. There are 40 columns, 15 of which may or may not contain the pattern of interest. The code that has got me closest to what I'm looking for is:

# Looking to match following exact pattern ',.,' which will always be
# preceded and followed by a sequence of characters or numbers.
# Note: the full stop in the pattern above signifies any character

df$rowsum <- rowSums(apply(df, 2, grep, pattern = ".*,.,.*"))


For each row, this provides a count of the columns that contain the pattern, however the issue I have is that any individual cell can contain this pattern more than once. I've tried several different function combinations to try to get to the answer, and realise that grep probably is not the solution as it spits out a logical whenever it finds the pattern, meaning it can only report a maximum of one pattern match for any particular cell. I need a solution that counts every occurrence of the pattern within each individual cell in a single row, and adds these values to provide a row sum total. This total is added
rowsum
column of that particular row.

For context a typical individual occurrence of the contents of a particular cell could be:

2212(AATTGCCCCACA,-,0.00)


Whereas if there were multiple occurrences they would exist in the cell as a continuous string each entry separated by a comma, for example for two entries:

144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)


I'm using the
,.,
as the unique identifier of each entry, as everything else for each entry is variable.

Here is some toy data:

df <-data.frame(NAMES = c('A', 'B', 'C', 'D'),
GENE1 = c("144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)", "2(TGTGAGTCAC,+,0.00)", "NA", "NA"),
GENE2 = c("632(TAAAGAGTCAC,-,0.00),60(GTCCCTCACT,-,0.00),", "7(TGTGAGTCAC,+,0.00)", "7(TGTGAGTCAC,+,0.00)", "NA"),
stringsAsFactors = F)


The optimum code would provide a data frame with a row sums column attached with totals:

# Omitted GENE column contents to save space

NAMES GENE1 GENE2 rowsum
A ... ... 4
B ... ... 2
C ... ... 1
D ... ... 0


Been stumped on this for 48 hrs. Any help would be greatly appreciated.

Answer

We can use str_extract from stringr

library(stringr)
df$rowsum <- Reduce(`+`, lapply(df[-1], 
        function(x) lengths(str_extract_all(x, "\\d+\\("))))
df$rowsum
#[1] 4 2 1 0
Comments