Tim C. Tim C. - 3 months ago 62
R Question

Number rows per group with RevoScaleR

I'm converting a local R script to make use of the

RevoScaleR
functions in the Revolution-R (aka Microsoft R Client/Server) package. This to be able to scale better with large amounts of data.

The goal is to create a new column that numbers the rows per group. Using
data.table
this would be achieved using the following code:

library(data.table)
eventlog[,ActivityNumber := seq(from=1, to=.N, by=1), by=Case.ID]


For illustration purposes, the output is something like this:

Case.ID ActivityNumber
1 A 1
2 A 2
3 B 1
4 C 1
5 C 2
6 C 3


After some research to do this using the
rx
-functions I found the package
dplyrXdf
, which is basically a wrapper to use
dplyr
functions on
Xdf
stored data, while still benefitting from the optimized functions of
RevoScaleR
(see http://blog.revolutionanalytics.com/2015/10/using-the-dplyrxdf-package.html)

In my case, this would lead to the following:

result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = seq_len(n()))


However, this leads to the following error:

ERROR: Attempting to add a variable without a name to an analysis.
Caught exception in file: CxAnalysis.cpp, line: 3756. ThreadID: 1248 Rethrowing.
Caught exception in file: CxAnalysis.cpp, line: 5249. ThreadID: 1248 Rethrowing.
Error in doTryCatch(return(expr), name, parentenv, handler) :
Error in executing R code: ERROR: Attempting to add a variable without a name to an analysis.


Any ideas how to solve this error? Or other (better?) approaches to get the requested result?

Answer

I'm not sure why this works, but try using seq_along(Case.ID) instead of seq_len(n()):

result <- eventlog %>%
  group_by(Case.ID) %>%
  mutate(ActivityNumber = seq_along(Case.ID))

It seems to be some problem with n(). Here's my exploratory code, in case anyone else wants to experiment:

options(stringsAsFactors = FALSE)

library(dplyrXdf)

# Set up some test data
eventlog_df <- data.frame(Case.ID = c("A", "A", "A", "A", "A", "B", "C", "C", "C"))

# Add a variable for artificially splitting the XDF into small chunks
eventlog_df$Chunk.ID <- factor((seq_len(nrow(eventlog_df)) + 2) %/% 3)

# Check the results
eventlog_df


# Now read it into an XDF file. I'm going to read just three rows in at a time
# so that the XDF file has several chunks, so we can be confident this works
# across chunks

eventlog <- tempfile(fileext = ".xdf")

for(i in 1:3) {
    rxImport(inData = eventlog_df[eventlog_df$Chunk.ID %in% i, ],
             outFile = eventlog,
             colInfo = list(Case.ID = list(type = "factor", 
                                           levels = c("A", "B", "C"))),
             append = file.exists(eventlog))
}

# Convert to a proper data source
eventlog <- RxXdfData(eventlog)

rxGetInfo(eventlog, getVarInfo = TRUE, numRows = 10)


# Now to dplyr. First, let's make sure it can count up the records
# in each group without any trouble.
result <- eventlog %>%
  group_by(Case.ID) %>%
  summarise(ActivityNumber = n())

# It can:
rxDataStep(result)


# Now if we switch to mutate, does n() still work?
result <- eventlog %>%
  group_by(Case.ID) %>%
  mutate(ActivityNumber = n())

# No - and it seems to be complaining about missing variables. So what if
# we try to refer to a variable we *know* exists?
result <- eventlog %>%
  group_by(Case.ID) %>%
  mutate(ActivityNumber = seq_along(Case.ID))

# It works
rxDataStep(result)