Paul Paul - 1 year ago 52
R Question

Aggregate entries in table by subset of column id characters

I am working on a gene expression dataset using R. I am fairly new to coding so please forgive me if I do not describe the problem in adequate detail.

My dataset looks that looks something like this:

GeneID Sample1 Sample2
Slc26a5-001 7 8
Slc26a5-002 1 2
Homer2-001 6 5
Slc26a5-200 8 10

The gene name is the first part of the ID (Slc26a5) and transcript number is denoted by the (-001). I need to find a way to collapse all of the different transcript ids together and sum their respective rows at the same time. The output would look like the following:

GeneID Sample1 Sample2
Slc26a5 16 20
Homer2 6 5

The Aggregate function should work for summing the rows together based on gene ID. However I am stuck because I can not figure out how to refer to the gene id's by only the first part of their name inside of the aggregate function.

Does anyone know how to do this?

Thanks for your help!

Answer Source

The main thing is to remove the tail part of the GeneID column so that you standardize the grouping. This is done below with sub(). Then it's pretty much just standard aggregation. With aggregate(), the following will do it.

aggregate(df[-1], list(GeneID = sub("-.*", "", df$GeneID)), sum)
#    GeneID Sample1 Sample2
# 1  Homer2       6       5
# 2 Slc26a5      16      20

We could also use rowsum() and not unnecessarily convert any data.

rowsum(df[-1], sub("-.*", "", df$GeneID))
#         Sample1 Sample2
# Homer2        6       5
# Slc26a5      16      20


df <- structure(list(GeneID = structure(c(2L, 3L, 1L, 4L), .Label = c("Homer2-001", 
"Slc26a5-001", "Slc26a5-002", "Slc26a5-200"), class = "factor"), 
    Sample1 = c(7L, 1L, 6L, 8L), Sample2 = c(8L, 2L, 5L, 10L)), .Names = c("GeneID", 
"Sample1", "Sample2"), class = "data.frame", row.names = c(NA,