Darren Darren - 2 months ago 19
R Question

R require cell counts for number of occurrences of regex pattern over entire data frame

I'm working in R and I have a data frame containing epigenetic information. I have 300,000 rows containing genomic locations and 15 columns each of which identifies a transcription factor motif that may or may not occur at each locus.

I'm trying to use regular expressions to count how many times each transcription factor occurs at each genomic locus. Individual motifs can occur > 15 times at any one locus, so I'd like the output to be a matrix/data frame containing motif counts for each individual cell of the data frame.

A typical single occurrence of a motif in a cell could be:


Whereas if there were multiple occurrences of a motif, these would exist in the cell as a continuous string each entry separated by a comma, for example for two entries:


Here is some toy data:

df <-data.frame(NAMES = c('LOC_A', 'LOC_B', 'LOC_C', 'LOC_D'),
TFM1 = c("144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)", "2(TGTGAGTCAC,+,0.00)", "0", "0"),
TFM2 = c("632(TAAAGAGTCAC,-,0.00),60(GTCCCTCACT,-,0.00),", "7(TGTGAGTCAC,+,0.00)", "7(TGTGAGTCAC,+,0.00)", "0"),
stringsAsFactors = F)

I'd be looking for the output in the following format:

LOC_A 2 2
LOC_B 1 1
LOC_C 0 1
LOC_D 0 0

If possible, I'd like to avoid for loops, but if loops are required so be it. To get row counts for this data frame I used the following code, kindly recommended by @akrun:

df$MotifCount <- Reduce(`+`, lapply(df[-1],
function(x) lengths(str_extract_all(x, "\\d+\\("))))

Notice that the unique identifier for the motifs used here is "\\d+\\(" to pick up the number and opening bracket at the start of each motif identification string. This would have to be included in any solution code. Something similar which worked across the whole data frame to provide individual cell counts would be ideal.

Many Thanks


We don't need the Reduce part

data.frame(c(df[1],lapply(df[-1], function(x) lengths(str_extract_all(x, "\\d+\\(")))) )
#1 LOC_A    2    2
#2 LOC_B    1    1
#3 LOC_C    0    1
#4 LOC_D    0    0