fugu - 1 year ago 59
R Question

# Perform calculation on each row of a single column from data frame

I have a data frame (

`data`
):

``````  sample chrom     pos ref alt tri trans decomposed_tri grouped_trans    type    feature       gene
1 1    1  659105   G   A CGT   G>A            ACG           C>T somatic     intron         ds
2 1    1 1227592   A   G CAC   A>G            GTG           T>C somatic     intron    CG42329
3 1    1 1775341   T   G CTG   T>G            CTG           T>G somatic intergenic intergenic
4 1    1 1775552   T   C GTT   T>C            GTT           T>C somatic intergenic intergenic
5 1    1 1812639   T   G GTG   T>G            GTG           T>G somatic intergenic intergenic
6 1    1 1812641   G   A GGA   G>A            TCC           C>T somatic intergenic intergenic
``````

And a list of genes with their lengths (
`gene_lengths`
):

``````\$`128up`
[1] 1553

\$`14-3-3epsilon`
[1] 8019

\$`14-3-3zeta`
[1] 10010

\$`140up`
[1] 1385

\$`18SrRNA-Psi:CR41602`
[1] 1974

\$`18SrRNA-Psi:CR45861`
[1] 1933
``````

And I want to:

a) Calculate the number of times you would expect to see a gene in this list given the length of the gene (in
`gene_lengths`
) and the length of the genome (
`137547960`
)

b) Calculate the number of times we actually see each gene
`hit_genes<-table(data\$gene)`

c) Calculate the a ratio of observed/expected
`fc<-gene_lengths[g]/gene_expect`

d) Return this as a data frame

Here's what I'm doing:

``````snv_count<-nrow(data) # total number of observations
hit_genes<-table(data\$gene) # the number of times I find each gene in my data
cat("gene", "observed", "expected", "fc", "\n")

for (g in levels(data\$gene)) {
genefraction<-gene_lengths[[g]]/137547960
gene_expect<-snv_count*(genefraction)
fc<-gene_lengths[g]/gene_expect
cat(g, hit_genes[g], gene_expect, fc, "\n")
}
``````

``````gene observed expected fc
128up 5 1.493344 3.348189
18SrRNA-Psi:CR45861 3 0.5076489 5.909596
C442219 4 0.03778505 105.862
``````

This works. However, I'm running this in a function, and want to return a data frame, how can I build a data frame row by row in the for loop? I've tried initialising an empty data frame before the loop:

``````df <- data.frame(gene = character(), observed = numeric(), expected = numeric(), fc = numeric())
``````

and then building row by row in the loop:

``````enriched <- rbind(df, data.frame(gene = g, observed = hit_genes[g], expected = gene_expect, fc = fc))
``````

But I get the following error:

``````Error in data.frame(gene = g, observed = hit_genes[g], expected = gene_expect,  :
arguments imply differing number of rows: 1, 0
``````

A further question is - should I be using
`ddply`
to achieve this rather than a loop?

Maybe with `?lapply`. (Untested.)

``````enriched <- lapply(levels(data\$gene), fun)
enriched <- do.call(rbind, enriched)
enriched

# 'fun' returns a list with four members
fun <- function(g) {
genefraction<-gene_lengths[[g]]/137547960
gene_expect<-snv_count*(genefraction)
fc<-hit_genes[g]/gene_expect
list(gene = g, observed = hit_genes[g], expected = gene_expect, fc = fc)
}
``````

Note that this assumes that the objects referred to in functions `fun` is available, i.e., in the global environment.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download