user3354212 user3354212 - 3 months ago 12
R Question

how to count heterogeneous double letters from a list of vectors in r

I have a dataframe:

df = read.table(text="ID V1
1 'TT AA TC GG'
2 'AT GG CC TG AA'
3 'GT AC TT AT'
4 'GC TA CT'
5 'AC'
6 'AA TT CC GG'", header=T, stringsAsFactors=F)


The
V1
column has different length of strings with homo-or hetero- double letters separated by space. I would like to count the number of hetero- double letters for each row.
I used
strsplit(as.character(df$V1), " ")
to convert lists. I know how to do it in a single string but not in lists. for example,
A=c("AA","TT","CC","AC","TC")
to count
sum(substr(A,1,1) != substr(A,2,2))
the expected result:

df = read.table(text="ID V1 num
1 'TT AA TC GG' 1
2 'AT GG CC TG AA' 2
3 'GT AC TT AT' 3
4 'GC TA CT' 3
5 'AC' 1
6 'AA TT CC GG' 0", header=T, stringsAsFactors=F)


Thank you for help.

Answer

One option is to split the strings and then use substr to extract the 1st and 2nd character separately, compare it to get a logical vector and sum it.

df1$num <- vapply(strsplit(df$V1, "\\s+"), function(x)
                        sum(substr(x,1,1)!= substr(x,2,2)), 0)
df1$num
#[1] 1 2 3 3 1 0

Or a compact option would be to count the words (\\w+) with str_count after removing all the homogenous substrings with gsub

library(stringr)
str_count(trimws(gsub("(\\S)\\1+", "", df$V1)), "\\w+")
#[1] 1 2 3 3 1 0

It should also work with leading/lagging spaces

str_count(gsub("(\\S)\\1+", "", df$V1), "\\w+")
#[1] 1 2 3 3 1 0