Anand Anand - 3 months ago 11
R Question

How to split epochs into year, month, etc

I have a data frame containing many time columns. I want to add columns for each time for year, month, date, etc.

Here is what I have so far:

library(dplyr)
library(lubridate)

times <- c(133456789, 143456789, 144456789 )
train2 <- data.frame(sent_time = times, open_time = times)

time_col_names <- c("sent_time", "open_time")
dt_part_names <- c("year", "month", "hour", "wday", "day")

train3 <- as.data.frame(train2)

dummy <- lapply(time_col_names, function(col_name) {
pct_times <- as.POSIXct(train3[,col_name], origin = "1970-01-01", tz = "GMT")
lapply(dt_part_names, function(part_name) {
part_col_name <- paste(col_name, part_name, sep = "_")
train3[, part_col_name] <- rep(NA, nrow(train3))
train3[, part_col_name] <- factor(get(part_name)(pct_times))
})
})


Everything seems to work, except the columns never get created or assigned. The components do get extracted, and the assignment succeeds without error, but train3 does not have any new columns.

I have checked that the assignment works when I call it outside the nested lapply context:

train3[, "x"] <- rep(NA, nrow(train3))


In this case, column x does get created.

Answer

It is often believed that the apply family provides an advantage in terms of performance compared to a for loop. But the most important difference between a for loop and a loop from the *apply() family is that the latter is designed to have no side effects.

The absence of side effects favors the development of clean, well-structured, and concise code. A problem occurs if one wishes to have side effects, which is usually a symptom of a flawed code design.

Here is a simple example to illustrate this:

myvector <- 10:1
sapply(myvector,prod,2)
# [1] 20 18 16 14 12 10  8  6  4  2

It looks correct, right? The sapply() loop has seemingly multiplied the entries of myvec by two (granted, this result could have been achieved more easily, but this is just a simple example to discuss the functioning of *apply()).

Upon inspection, however, one realizes that this operation has not changed myvector at all:

> myvector
# [1] 10  9  8  7  6  5  4  3  2  1

That is because sapply() did not have the side effect to modify myvector. In this example the sapply() loop is equivalent to the command print(myvector*2), and not to myvector <- myvector * 2. The *apply() loops return an object, but they don't modify the original one.

If one really wants to change the object within the loop, the superassignment operator <<- is necessary to modify the object outside the scope of the loop. This should almost never be done, and things become quite ugly in this case. For example, the following loop does change my myvector:

sapply(seq_along(myvector), function(x) myvector[x] <<- myvector[x]*2)
> myvector
# [1] 20 18 16 14 12 10  8  6  4  2

Coding in R should not look like this. Note that also in this more convoluted case, if the normal assignment operator <- is used instead of <<- then myvector remains unchanged. The correct approach is to assign the object returned by *apply instead of modifying it within the loop.

In the specific case described by the OP, the variable dummy may contain the desired output if the commands in the loop are correct. But one cannot expect that the object train3 is modified within the loop. For this the <<- operator would be necessary.

A quote mentioned in fortunes::fortune(212) possibly summarizes the problem:

Basically R is reluctant to let you shoot yourself in the foot unless you are really determined to do so. -- Bill Venables