Trevor Alexander Trevor Alexander - 1 month ago 16
R Question

Undocumented error in dcast.data.table

(This was posted previously at the data-table-help mailing list, but it's been a few weeks without comment, and I did a little more to try to debug it.)

I ran into a strange error that an internet search only turns up in the commit log of

data.table
:

# Error in dcast.data.table(test.table, as.formula(paste(class.col, "+", :
# retFirst must be integer vector the same length as nrow(i)


This came up on running a previously tested working dcast.data.table expression, on a data.table I have subsetted by randomly resampling
Trial
with replacement. The offending section is this:

dcast.data.table(test.table,
Class + Time + Trial ~ Channel,
value.var = "Voltage",
fun.aggregate=identity)


It seems to be choking on near-duplicate rows in the input table (i.e., the error is the same with or without the
id
column present in the table):

test.table <- structure(list(Trial = c(1169L, 1169L), Sample = c(155L, 155L
), Class = c(1L, 1L), Subject = structure(c(13L, 13L), .Label = c("s01",
"s02", "s03", "s04", "s05", "s06", "s07", "s08", "s09", "s10",
"s11", "s12", "s13"), class = "factor"), Channel = c(1L, 1L),
Voltage = structure(c(-0.992322316444497, -0.992322316444497
), "`scaled:center`" = -6.23438399446429e-16, "`scaled:scale`" = 1),
Time = c(201.149466192171, 201.149466192171), Baseline = c(0.688151312347969,
0.688151312347969), id = 1:2), .Names = c("Trial", "Sample",
"Class", "Subject", "Channel", "Voltage", "Time", "Baseline",
"id"), class = c("data.table", "data.frame"), row.names = c(NA,
-2L), sorted = "id")

test.table
# Trial Sample Class Subject Channel Voltage Time Baseline id
# 1: 1169 155 1 s13 1 -0.9923223 201.1495 0.6881513 1
# 2: 1169 155 1 s13 1 -0.9923223 201.1495 0.6881513 2
dcast.data.table(test.table,
Class + Time + Trial ~ Channel,
value.var = "Voltage",
fun.aggregate=identity)
# Error in dcast.data.table(test.table, Class + Time + Trial ~ Channel, :
# retFirst must be integer vector the same length as nrow(i)


Changing a single column in the
dcast
formula gets close to the output I am looking for:

test.table[2,Trial:=1170]
dcast.data.table(test.table,
Class + Time + Trial ~ Channel,
value.var = "Voltage",
fun.aggregate=identity)
# Class Time Trial 1
# 1: 1 201.1495 1169 -0.9923223
# 2: 1 201.1495 1170 -0.9923223


What's bothering data.table? I tried changing keys and messing with the order of the formula terms just to see, because I don't understand the error, but that didn't work.

If I replace the function call with regular
dcast
from
reshape2
, I get a seemingly unrelated error:

# Error in vapply(indices, fun, .default) : values must be length 0, but FUN(X[[29]]) result is length 1


At this point in my code I don't care if the
Trial
values are correct, so I could work around this by replacing it in the formula with
id
, but I'm interested in a more general or robust solution.

Answer

Update: Fixed in commit 1253 of v1.9.3. From NEWS:

  • dcast.data.table provides better error message when fun.aggregate is specified but it returns length != 1. Closes git #693. Thanks to Trevor Alexander for reporting here on SO.

I agree that the error message should be more helpful in understanding the issue and it usually is in data.table. This is just a case I hadn't foreseen.

If you could please file the issue here as a bug, I'll fix it when I've some time.


Your problem, however, seems quite trivial RTFM to me. From ?dcast.data.table:

fun.aggregate - Should the data be aggregated before casting? If the formula doesn't identify single observation for each cell, then aggregation defaults to length with a message.

In the DETAILS section: "... fun.aggregate will have to be used. The aggregating function should take a vector as input and return a single value (or a list of length one) as output." ...

In your example, your formula's LHS results in two identical rows, which means fun.aggregate has to be used - which'll default to length if you dint use one (like reshape2:::dcast does). And you've used identity, which'll just return the values back. So it returns both the values for Voltage, which the function doesn't like.

The error message should be something like:

Error: fun.aggregate should return, for each unique group (from formula's LHS), a length 1 vector, but returns length=2 for a group.

Or something of that sort. Feel free to suggest better / clearer error messages.


PS: I don't understand what you mean by near-duplicate.

identical(test.table[1, list(Class, Time, Trial)], 
          test.table[2, list(Class, Time, Trial)])
# [1] TRUE

If you use id column on the LHS, then you should be able to get the desired result, as you can now uniquely identify the rows...

dcast.data.table(test.table, 
             Class + Time + Trial ~ Channel + id,
             value.var = "Voltage",
             fun.aggregate=identity)

#    Class     Time Trial        1_1        1_2
# 1:     1 201.1495  1169 -0.9923223 -0.9923223

The function only considers the columns given in the formula LHS to find out if there are/aren't unique rows, not if your actual input data has unique rows (if that was the confusion).


To answer OP's 2nd comment:

The only way currently to get a result (without error) is if your function returns a list:

dcast.data.table(test.table, 
             Class + Time + Trial ~ Channel,
             value.var = "Voltage",
             fun.aggregate=list)
#    Class     Time Trial                     1
# 1:     1 201.1495  1169 -0.9923223,-0.9923223

Then you can just check if the columns are all of length 1 and if so, unlist.