K.Terr K.Terr - 16 days ago 4
R Question

How to create a new column showing if and how many variables share a specific observation

I have a question concerning the analysis of some bioinformatics data in R.

My test data frame consists of a variable "sequence" with different letter codes as observations and three different variables representing individuals/samples (P1, P2, P3) that say how often the particular observation was counted in an individual (so P3 contains the sequence "AB" 23 times for example).
I want to create a new column now (already indicated in my data frame as dummy column X with NA) that shows for each sequence row if the sequence is overall shared between individuals (P1, P2, P3) and more importantly how many of the three individuals share it. The numbers in the new column can therefore range only from 1 to 3. For example: for sequence "ABCDE" the new column would show value 1 because it occurs only in one individual P3, for sequence "ABC" the new column would show value 2 because it occurs in both individuals P2 and P3 and finally for "ABCD" it would show 3 since all individuals contain the sequence.

My test data looks like this after dput():

structure(list(Sequence = structure(1:9, .Label = c("AB", "ABC",
"ABCD", "ABCDE", "ABCDEF", "ABCDEFG", "ABCDEFGH", "ABCDEFGHI",
"ABCDEFGHIJ"), class = "factor"), P1 = c(5L, 0L, 20L, 0L, 3L,
1L, 0L, 0L, 0L), P2 = c(6L, 2L, 3L, 0L, 2L, 0L, 56L, 10L, 3L),
P3 = c(23L, 34L, 8L, 5L, 0L, 6L, 0L, 78L, 5L), X = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Sequence",
"P1", "P2", "P3", "X"), class = "data.frame", row.names = c(NA,
-9L))


Thank you!

Answer

You can try to sum the "P." columns with a positive count:

mydf$X <- rowSums(mydf[, grep("^P", names(mydf))]>0)

 head(mydf, 4)
#  Sequence P1 P2 P3 X
#1       AB  5  6 23 3
#2      ABC  0  2 34 2
#3     ABCD 20  3  8 3
#4    ABCDE  0  0  5 1