jamborta jamborta - 1 month ago 19
R Question

Select multiple columns in data.table

What is the equivalent of selecting multiple columns in

data.table
just like this in
data.frame
?

df <- data.frame(a = 1, b = 2, c = 3)
df[, 2:3]
# b c
# 1 2 3

Answer

(For info on recent changes in data.table that obviate the need for with=FALSE (currently only available in the development version), see the UPDATE below.)


With data.table versions 1.9.6 and earlier, just set with = FALSE:

library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[, 2:3, with = FALSE]
#    b c
# 1: 2 3
# 2: 3 4

As far as I can tell, the argument is named "with" because it determines whether the column index should be evaluated within the frame of the data.table, as it would be when using, e.g., base R's with() and within().

From ?data.table:

By default with=TRUE and j is evaluated within the frame of x. The column names can be used as variables.

When with=FALSE j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table...

And there is some related thinking in ?setkey :

It isn't good programming practice, in general, to use column numbers rather than names. [...] If you use column numbers then bugs (possibly silent) can more easily creep into your code as time progresses if changes are made elsewhere in your code; e.g., if you add, remove or reorder columns in a few months time, a setkey [or a select] by column number will then refer to a different column, possibly returning incorrect results with no warning. (A similar concept exists in SQL where "select * from ..." is considered poor programming style [by some] when a robust, maintainable system is required.) If you really wish to use column numbers, it's possible but deliberately a little harder; e.g., setkeyv(DT,colnames(DT)[1:2]) [or setting with=FALSE in selects].


UPDATE: 2016-10-18

The current development version of data.table (v1.9.7) (installation instructions here) now implements a more data.table-consistent column selection syntax. It will ship with future stable versions distributed on CRAN, starting with v1.9.8.

Now, without explicitly setting with=FALSE, any of the following calls will work just as you'd hope/expect them to:

dt <- data.table(a = 1,b = 2,c = 3)
dt[, 2]
#    b
# 1: 2
dt[, 2:3]
#    b c
# 1: 2 3
dt[, "a"]
#    a
# 1: 1
dt[, c("a","b")]
#    a b
# 1: 1 2

The relevant NEWS entry describes this and another related change:

When j contains no unquoted variable names (whether column names or not), with= is now automatically set to FALSE. Thus, DT[,1], DT[,"someCol"], DT[,c("colA","colB")] and DT[,100:109] now work as we all expect them to; i.e., returning columns, #1188, #1149. Since there are no variable names there is no ambiguity as to what was intended. DT[,colName1:colName2] no longer needs with=FALSE either since that is also unambiguous; it's a single call to the : function so with=TRUE could make no sense, despite the presence of unquoted variable names. These changes can be made since nobody can be using the existing behaviour of returning back the literal j value since that can never be useful. This provides a new ability and should not break any existing code. Selecting a single column still returns a 1-column data.table (not a vector, unlike data.frame by default) for type consistency for code (e.g. within DT[...][...] chains) that can sometimes select several columns and sometime one, as has always been the case in data.table and we have no intention to bring back drop. In future, DT[,myCols] (i.e. a single variable name) will look for myCols in calling scope without needing to set with=FALSE too, just as a single symbol appearing in i does already. The new behaviour can be turned on now by setting the option: options(datatable.WhenJisSymbolThenCallingScope=TRUE). The default is currently FALSE to give you time to change your code. In this future state, one way (i.e. DT[,theColName]) to select the column as a vector rather than a 1-column data.table will no longer work leaving the two other ways that have always worked remaining (since data.table is still just a list after all): DT[["someCol"]] and DT$someCol. Those base R methods are faster too (when iterated many times) by avoiding the small argument checking overhead inside the more flexible DT[...] syntax as has been highlighted in example(data.table) for many years. In the next release, DT[,someCol] will continue with old current behaviour but start to warn if the new option is not set. Then the default will change to TRUE to nudge you to move forward whilst still retaining a way for you to restore old behaviour for this feature only, whilst still allowing you to benefit from other new features of the latest release without changing your code. Then finally after an estimated 2 years from now, the option will be removed.