user3354212 - 2 months ago 9
R Question

# how to count heterogeneous double letters from a list of vectors in r

I have a dataframe:

``````df = read.table(text="ID    V1
1   'TT AA TC GG'
2   'AT GG CC TG AA'
3   'GT AC TT AT'
4   'GC TA CT'
5   'AC'
6   'AA TT CC GG'", header=T, stringsAsFactors=F)
``````

The
`V1`
column has different length of strings with homo-or hetero- double letters separated by space. I would like to count the number of hetero- double letters for each row.
I used
`strsplit(as.character(df\$V1), " ")`
to convert lists. I know how to do it in a single string but not in lists. for example,
`A=c("AA","TT","CC","AC","TC")`
to count
`sum(substr(A,1,1) != substr(A,2,2))`
the expected result:

``````df = read.table(text="ID    V1  num
1   'TT AA TC GG'   1
2   'AT GG CC TG AA'    2
3   'GT AC TT AT'   3
4   'GC TA CT'  3
5   'AC'    1
6   'AA TT CC GG'   0", header=T, stringsAsFactors=F)
``````

Thank you for help.

One option is to split the strings and then use `substr` to extract the 1st and 2nd character separately, compare it to get a logical vector and `sum` it.

``````df1\$num <- vapply(strsplit(df\$V1, "\\s+"), function(x)
sum(substr(x,1,1)!= substr(x,2,2)), 0)
df1\$num
#[1] 1 2 3 3 1 0
``````

Or a compact option would be to count the words (`\\w+`) with `str_count` after removing all the homogenous substrings with `gsub`

``````library(stringr)
str_count(trimws(gsub("(\\S)\\1+", "", df\$V1)), "\\w+")
#[1] 1 2 3 3 1 0
``````

It should also work with leading/lagging spaces

``````str_count(gsub("(\\S)\\1+", "", df\$V1), "\\w+")
#[1] 1 2 3 3 1 0
``````