Tim C. Tim C. - 11 months ago 195
R Question

Number rows per group with RevoScaleR

I'm converting a local R script to make use of the

functions in the Revolution-R (aka Microsoft R Client/Server) package. This to be able to scale better with large amounts of data.

The goal is to create a new column that numbers the rows per group. Using
this would be achieved using the following code:

eventlog[,ActivityNumber := seq(from=1, to=.N, by=1), by=Case.ID]

For illustration purposes, the output is something like this:

Case.ID ActivityNumber
1 A 1
2 A 2
3 B 1
4 C 1
5 C 2
6 C 3

After some research to do this using the
-functions I found the package
, which is basically a wrapper to use
functions on
stored data, while still benefitting from the optimized functions of
(see http://blog.revolutionanalytics.com/2015/10/using-the-dplyrxdf-package.html)

In my case, this would lead to the following:

result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = seq_len(n()))

However, this leads to the following error:

ERROR: Attempting to add a variable without a name to an analysis.
Caught exception in file: CxAnalysis.cpp, line: 3756. ThreadID: 1248 Rethrowing.
Caught exception in file: CxAnalysis.cpp, line: 5249. ThreadID: 1248 Rethrowing.
Error in doTryCatch(return(expr), name, parentenv, handler) :
Error in executing R code: ERROR: Attempting to add a variable without a name to an analysis.

Any ideas how to solve this error? Or other (better?) approaches to get the requested result?


I'm not sure why this works, but try using seq_along(Case.ID) instead of seq_len(n()):

result <- eventlog %>%
  group_by(Case.ID) %>%
  mutate(ActivityNumber = seq_along(Case.ID))

It seems to be some problem with n(). Here's my exploratory code, in case anyone else wants to experiment:

options(stringsAsFactors = FALSE)


# Set up some test data
eventlog_df <- data.frame(Case.ID = c("A", "A", "A", "A", "A", "B", "C", "C", "C"))

# Add a variable for artificially splitting the XDF into small chunks
eventlog_df$Chunk.ID <- factor((seq_len(nrow(eventlog_df)) + 2) %/% 3)

# Check the results

# Now read it into an XDF file. I'm going to read just three rows in at a time
# so that the XDF file has several chunks, so we can be confident this works
# across chunks

eventlog <- tempfile(fileext = ".xdf")

for(i in 1:3) {
    rxImport(inData = eventlog_df[eventlog_df$Chunk.ID %in% i, ],
             outFile = eventlog,
             colInfo = list(Case.ID = list(type = "factor", 
                                           levels = c("A", "B", "C"))),
             append = file.exists(eventlog))

# Convert to a proper data source
eventlog <- RxXdfData(eventlog)

rxGetInfo(eventlog, getVarInfo = TRUE, numRows = 10)

# Now to dplyr. First, let's make sure it can count up the records
# in each group without any trouble.
result <- eventlog %>%
  group_by(Case.ID) %>%
  summarise(ActivityNumber = n())

# It can:

# Now if we switch to mutate, does n() still work?
result <- eventlog %>%
  group_by(Case.ID) %>%
  mutate(ActivityNumber = n())

# No - and it seems to be complaining about missing variables. So what if
# we try to refer to a variable we *know* exists?
result <- eventlog %>%
  group_by(Case.ID) %>%
  mutate(ActivityNumber = seq_along(Case.ID))

# It works