bli bli - 13 days ago 6
R Question

Extract and paste together multiple columns of a data frame like object using a vector of column names

I have an object (variable

rld
) which looks a bit like a "data.frame" (see further down the post for details) in that it has columns that can be accessed using
$
or
[[]]
.

I have a vector
groups
containing names of some of its columns (3 in example below).

I generate strings based on combinations of elements in the columns as follows:

paste(rld[[groups[1]]], rld[[groups[2]]], rld[[groups[3]]], sep="-")


I would like to generalize this so that I don't need to know how many elements are in
groups
.

The following attempt fails:

> paste(rld[[groups]], collapse="-")
Error in normalizeDoubleBracketSubscript(i, x, exact = exact, error.if.nomatch = FALSE) :
attempt to extract more than one element


Here is how I would do in functional-style with a python dictionary:

map("-".join, zip(*map(rld.get, groups)))


Is there a similar column-getter operator in R ?




As suggested in the comments, here is the output of
dput(rld)
: http://paste.ubuntu.com/23528168/ (I could not paste it directly, since it is huge.)

This was generated using the DESeq2 bioinformatics package, and more precisely, doing something similar to what is described page 28 of this document: https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf.

DESeq2 can be installed from bioconductor as follows:

source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")


Reproducible example



One of the solutions worked when running in interactive mode, but failed when the code was put in a library function, with the following error:

Error in do.call(function(...) paste(..., sep = "-"), colData(rld)[groups]) :
second argument must be a list


After some tests, it appears that the problem doesn't occur if the function is in the main calling script, as follows:

library(DESeq2)
library(test.package)

lib_names <- c(
"WT_1",
"mut_1",
"WT_2",
"mut_2",
"WT_3",
"mut_3"
)
file_names <- paste(
lib_names,
"txt",
sep="."
)

wt <- "WT"
mut <- "mut"
genotypes <- rep(c(wt, mut), times=3)
replicates <- c(rep("1", times=2), rep("2", times=2), rep("3", times=2))

sample_table = data.frame(
lib = lib_names,
file_name = file_names,
genotype = genotypes,
replicate = replicates
)

dds_raw <- DESeqDataSetFromHTSeqCount(
sampleTable = sample_table,
directory = ".",
design = ~ genotype
)

# Remove genes with too few read counts
dds <- dds_raw[ rowSums(counts(dds_raw)) > 1, ]
dds$group <- factor(dds$genotype)
design(dds) <- ~ replicate + group
dds <- DESeq(dds)

test_do_paste <- function(dds) {
require(DESeq2)
groups <- head(colnames(colData(dds)), -2)
rld <- rlog(dds, blind=F)
stopifnot(all(groups %in% names(colData(rld))))
combined_names <- do.call(
function (...) paste(..., sep = "-"),
colData(rld)[groups]
)
print(combined_names)
}

test_do_paste(dds)
# This fails (with the same function put in a package)
#test.package::test_do_paste(dds)


The error occurs when the function is packaged as in https://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/

Data used in the example:



I will post this issue as a separate question.

Although I have an answer to my initial question, I'm still interested in alternative solutions for the "column extraction using a vector of column names" issue.

Answer

We may use either of the following:

do.call(function (...) paste(..., sep = "-"), rld[groups])
do.call(paste, c(rld[groups], sep = "-"))

We can consider a small, reproducible example:

rld <- mtcars[1:5, ]
groups <- names(mtcars)[c(1,3,5,6,8)]
do.call(paste, c(rld[groups], sep = "-"))
#[1] "21-160-3.9-2.62-0"     "21-160-3.9-2.875-0"    "22.8-108-3.85-2.32-1" 
#[4] "21.4-258-3.08-3.215-1" "18.7-360-3.15-3.44-0"

Note, it is your responsibility to ensure all(groups %in% names(rld)) is TRUE, otherwise you get "subscript out of bound" or "undefined column selected" error.


(I am copying your comment as a follow-up)

It seems the methods you propose don't work directly on my object. However, the package I'm using provides a colData function that makes something more similar to a data.frame:

> class(colData(rld))
[1] "DataFrame"
attr(,"package")
[1] "S4Vectors"

do.call(function (...) paste(..., sep = "-"), colData(rld)[groups]) works, but do.call(paste, c(colData(rld)[groups], sep = "-")) fails with an error message I fail to understand (as too often with R...):

> do.call(paste, c(colData(rld)[groups], sep = "-"))
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘mcols’ for signature ‘"character"’