Peter Chen Peter Chen - 1 year ago 36
R Question

change date to the end day of the month by data.table

I have 4190145 observations.

And I want to change my date to the end day of the month. I exlain in the following.

Here is my part of data:

Time1
2015/01/15
2015/02/24
2015/07/18
2015/11/10
2016/02/20
2016/04/26
2016/08/17


I want to create new column
Time2
:

Time1 Time2
2015/01/15 2015/01/31
2015/02/24 2015/02/28
2015/07/18 2015/07/31
2015/11/10 2015/11/30
2016/02/20 2016/02/29
2016/04/26 2016/04/30
2016/08/17 2016/08/31


The code is:

data[, Time2 := Time1]
day(data$Time2) <- days_in_month(data$Time1)


However, I got the error.

Error: cannot allocate vector of size N Mb


Therefore, I search my problem on stackoverflow and find this.

I use
gc()
but still not work. So I see my
sessionInfo()
:

sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] lubridate_1.6.0 data.table_1.10.4

loaded via a namespace (and not attached):
[1] lazyeval_0.2.0 R6_2.2.0 assertthat_0.1 magrittr_1.5 DBI_0.5-1 tools_3.3.3 dplyr_0.5.0 tibble_1.2
[9] Rcpp_0.12.9 stringi_1.1.2 stringr_1.1.0


I think it is because my computer just 32-bit. And
memory.limit()
can just set size to 4000 because my RAM is 4 GB.

Moreover, I figure out that if I use
data.table
to execute, I can run.

So, my problem is how to change my code.
day(data$Time2) <- days_in_month(data$Time1)
to
data.table
form.

Maybe like
data[, day(data$Time2) := days_in_month(data$Time1)]


I know this is incorrect because I get
Error
:
Can't assign to the same column twice in the same query (duplicates detected).


Any suggestion?

Answer Source

Perhaps this works?

set.seed(120340)
NN = 5e6
DT = data.table(Time1 = 
                  sprintf('%04d/%02d/%02d',
                          sample(2000:2017, NN, TRUE),
                          sample(12, NN, TRUE),
                          sample(28, NN, TRUE)))

# potential memory bottleneck
DT[ , c('y', 'm', 'd') := tstrsplit(Time1, '/')]

days_month = data.table(
  month = sprintf('%02d', 1:12),
  days = c(31L, 28L, 31L, 30L, 31L, 30L, 
           31L, 31L, 30L, 31L, 30L, 31L)
)

DT[days_month, d_end := i.days, on = c(m = 'month')]
DT[m == 2L & as.integer(y) %% 4L == 0L, d_end := 29L]
DT[ , Time2 := do.call(paste, c(.SD, list(sep = '/'))), 
    .SDcols = c('y', 'm', 'd_end')]

If that fails, I guess this will be less memory-intensive:

DT[ , y := gsub('/.*', '', Time1)]
DT[ , c('m', 'd') := tstrsplit(Time1, '/')[2L:3L], by = y]

And if that fails, I suggest buying more RAM, or working with a subset of your data before deploying to a more serious machine. Also, as Frank points out, you really want to be storing these as IDates for maximal memory efficiency.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download