ovon ovon - 1 month ago 12
R Question

Iterative subtraction based on factors in a data frame using R

I'm struggling to come up with a working solution to what seems like a fairly simple problem. I have a data frame with both data and factors in it, and I'd like to use the factors to decide which data points need to be subtracted from other data points to produce a new data frame of comparative values.

Here's what the data frame is like:

str(means)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 32 obs. of 5 variables:
$ rat : Factor w/ 8 levels "Rat1","Rat2",..: 1 1 1 1 2 2 2 2 3 3 ...
$ gene : Factor w/ 4 levels "gene1","gene2",..: 1 2 3 4 1 2 3 4 1 2 ...
$ gene_category: Factor w/ 2 levels "control","experimental": 2 2 1 1 2 2 1 1 2 2 ...
$ timepoint1 : num 23.4 18.3 42.1 40.1 25.3 ...
$ timepoint2 : num 23.5 18.4 41.5 39.9 22.8 ...
> head(means)
Source: local data frame [6 x 5]
Groups: rat, gene [6]

rat gene gene_category timepoint1 timepoint2
(fctr) (fctr) (fctr) (dbl) (dbl)
1 Rat1 gene1 experimental 23.36667 23.49667
2 Rat1 gene2 experimental 18.26000 18.38000
3 Rat1 gene3 control 42.05500 41.45000
4 Rat1 gene4 control 40.08667 39.89500
5 Rat2 gene1 experimental 25.29333 22.83000
6 Rat2 gene2 experimental 19.72667 19.19333


For each rat (8 rats in total), I'd like to subtract the 'control' gene values (genes 3 and 4) from the 'experimental' gene values (genes 1 and 2). I need to do this iteratively, so each experimental gene value must have each control gene value subtracted from it (within each rat, but not between rats). The above should be done for each timepoint column.

I've been fiddling with a solution using dplyr, I've got the grouping down but I can't figure out how to do the rest:

diffs <- means %>% group_by(rat, gene, gene_category) %>% here_is_where_i_don't_know_what_to_do)


There is a solution here to a similar problem here but I think it will give me every pairwise operation possible, and that's not what I'm looking for. It also only deals with two factors, while I have three I need to consider.

Here's another solution to a similar problem, but again there are some things about it that make it less than ideal. It deals with one factor only and I'm not sure how it could be applied to a dataset with three factors and two data vectors.

I know that this problem is solved when doing something like a pairwise comparison to determine statistical significance (multiple t-tests, ANOVA, MANOVA, etc), but the packages/base stat functions I'm familiar with that do these tests keep this basic operation under the hood. I'd like a simple solution that uses as few loops as possible with either base R or dplyr/plyr/reshape2, etc.

Answer

I think the solution will involve generating the comparisons you want, then passing them to a standard evaluation mutate_ instead of fighting with group_by and summarize.

First, here are the data read in (note, added genes 3/4 for rat2):

means <-
  read.table(text =
" rat   gene gene_category timepoint1 timepoint2
1   Rat1  gene1  experimental   23.36667   23.49667
2   Rat1  gene2  experimental   18.26000   18.38000
3   Rat1  gene3       control   42.05500   41.45000
4   Rat1  gene4       control   40.08667   39.89500
5   Rat2  gene1  experimental   25.29333   22.83000
6   Rat2  gene2  experimental   19.72667   19.19333
7   Rat2  gene3       control   42.05500   41.45000
8   Rat2  gene4       control   40.08667   39.89500")

Next, generate the set of genes within each class:

geneLists <-
  means %>%
  {split(.$gene, .$`gene_category`)} %>%
  lapply(unique) %>%
  lapply(as.character) %>%
  lapply(function(x){paste0("`", x, "`")})

Note that that backticks "`" are to protect against potentially invalid column names (e.g., things with spaces). This gives:

$control
[1] "`gene3`" "`gene4`"

$experimental
[1] "`gene1`" "`gene2`"

Then, paste together the comparisons you want:

colsToCreate <-
  outer(geneLists[["experimental"]]
        , geneLists[["control"]]
        , paste, sep = " - ") %>%
  as.character()

Giving:

[1] "`gene1` - `gene3`" "`gene2` - `gene3`" "`gene1` - `gene4`" "`gene2` - `gene4`"

Then, use tidyr to spread the data, generating one row per rat. Note, if you want to spread both timepoint1 and timepoint2, you would likely need to gather first (put both times in one column), then create an id column with both time and gene, then spread using that single id column. This would also require changes to the colsToCreate construction.

After spreading, pass the vector of columns to generate, and you should have what you want:

means %>%
  select(rat, gene, timepoint1) %>%
  spread(gene, timepoint1) %>%
  mutate_(.dots = colsToCreate)

Voila:

   rat    gene1    gene2  gene3    gene4 gene1 - gene3 gene2 - gene3 gene1 - gene4 gene2 - gene4
1 Rat1 23.36667 18.26000 42.055 40.08667     -18.68833     -23.79500     -16.72000     -21.82667
2 Rat2 25.29333 19.72667 42.055 40.08667     -16.76167     -22.32833     -14.79334     -20.36000

Actually, getting both timepoints is even easier than I had thought it would be:

means %>%
  select(-gene_category) %>%
  gather("timepoint", "value", starts_with("timepoint")) %>%
  spread(gene, value) %>%
  mutate_(.dots = colsToCreate)

gives:

   rat  timepoint    gene1    gene2  gene3    gene4 gene1 - gene3 gene2 - gene3 gene1 - gene4 gene2 - gene4
1 Rat1 timepoint1 23.36667 18.26000 42.055 40.08667     -18.68833     -23.79500     -16.72000     -21.82667
2 Rat1 timepoint2 23.49667 18.38000 41.450 39.89500     -17.95333     -23.07000     -16.39833     -21.51500
3 Rat2 timepoint1 25.29333 19.72667 42.055 40.08667     -16.76167     -22.32833     -14.79334     -20.36000
4 Rat2 timepoint2 22.83000 19.19333 41.450 39.89500     -18.62000     -22.25667     -17.06500     -20.70167

Note also that you can name the vector that holds the column calculating formulas, e.g.,:

colsToCreate2 <-
  setNames(colsToCreate
           , c("nameA", "nameB", "nameC", "nameD"))

means %>%
  select(rat, gene, timepoint1) %>%
  spread(gene, timepoint1) %>%
  mutate_(.dots = colsToCreate2)

gives:

   rat    gene1    gene2  gene3    gene4     nameA     nameB     nameC     nameD
1 Rat1 23.36667 18.26000 42.055 40.08667 -18.68833 -23.79500 -16.72000 -21.82667
2 Rat2 25.29333 19.72667 42.055 40.08667 -16.76167 -22.32833 -14.79334 -20.36000

I'm not sure why, but this question excites me enough that I wanted to complete the idea. Here, I gather the comparisons back into long form, then mutate the timepoint into a number using parse_number from readr and separate out the compared genes into separate columns to allow efficient access and filtering. Do note that the repeated use of each gene removes assumptions of independence so do not do stats on these without a lot of very careful thinking about control.

longForm <-
  means %>%
  select(-gene_category) %>%
  gather("timepoint", "value", starts_with("timepoint")) %>%
  spread(gene, value) %>%
  mutate_(.dots = colsToCreate) %>%
  select_(.dots = paste0("-",unlist(geneLists))) %>%
  gather(Comparison, Difference, -rat, -timepoint) %>%
  mutate(time = parse_number(timepoint)) %>%
  separate(Comparison, c("exp_Gene", "cont_Gene"), " - ")

head(longForm)

gives

   rat  timepoint exp_Gene cont_Gene Difference time
1 Rat1 timepoint1    gene1     gene3  -18.68833    1
2 Rat1 timepoint2    gene1     gene3  -17.95333    2
3 Rat2 timepoint1    gene1     gene3  -16.76167    1
4 Rat2 timepoint2    gene1     gene3  -18.62000    2
5 Rat1 timepoint1    gene2     gene3  -23.79500    1
6 Rat1 timepoint2    gene2     gene3  -23.07000    2

Then, we can plot the results:

longForm %>%
  ggplot(aes(x = time
             , y = Difference
             , col = rat)) +
  geom_line() +
  facet_grid(exp_Gene ~ cont_Gene)

enter image description here