lurodrig lurodrig - 1 month ago 14
R Question

Nested foreach loop changing values in a dataframe R

I am trying to convert two nested for loops to two nested foreach loops to change values of a dataframe based on matching prerequisites. Reason being I believe I can speed up the process significantly. Below is an example of my code:

library(foreach) # for loop to parallelize
library(doMC) # create the number of cores to use

# set the number of cores to use
registerDoMC(22) # number of CPU cores

file_list <- c("a", "b", "c")
ldf <- c(data.frame(Date = c("2016-10-01", "2016-10-02", "2016-10-03", "2016-10-04")),
data.frame(Date = c("2016-10-07", "2016-10-08", "2016-10-09")),
data.frame(Date = c("2016-10-15", "2016-10-16", "2016-10-17", "2016-10-18", "2016-10-19")))

DF <- data.frame(Date = seq(as.POSIXct("2016-10-01", tz = "UTC"), as.POSIXct("2016-10-31", tz = "UTC"), by = 'day'),
A = 0,
B = 0,
C = 0)

DF2 <- DF # DF2 is used to compare my attempt result


for (i in 1:length(file_list))
{
Date <- ldf[[i]]
Date <- as.POSIXct(Date, tz = "UTC")

for (j in 1:length(Date))
{
ROW <- which(DF$Date == Date[j])
DF[ROW,i+1] <- 1
}

}

throwaway <- foreach (i = 1:length(file_list)) %dopar%
{
Date <- ldf[[i]]
Date <- as.POSIXct(Date, tz = "UTC")

foreach (j = 1:length(Date)) %do%
{
ROW <- which(DF2$Date == Date[j])
DF2[ROW,i+1] <- 1
return(NULL)
}
}


filelist
is a list of files that I am reading in

ldf
is the variable used to store the files that are read

These two variables are made up in this example, just to have a reproducible example.

DF
is where I am going to store the changes in the values made by the
foreach
loops

DF2
is my attempted try and where it is stored

The output I am looking for is that of
DF
, but
DF2
remains unchanged. I understand foreach loops are designed for their return values, but how can I get the return values to match with the locations of where the values of the dataframe should change. These values are where the date of each file read in
file_list
match with the dates in the dataframe
DF2
. If they match, then a 1 is placed in that particular location of row (Date) and column (Filename). Thanks in advance for any help!

Desired output is:

> DF
Date A B C
1 2016-10-01 1 0 0
2 2016-10-02 1 0 0
3 2016-10-03 1 0 0
4 2016-10-04 1 0 0
5 2016-10-05 0 0 0
6 2016-10-06 0 0 0
7 2016-10-07 0 1 0
8 2016-10-08 0 1 0
9 2016-10-09 0 1 0
10 2016-10-10 0 0 0
11 2016-10-11 0 0 0
12 2016-10-12 0 0 0
13 2016-10-13 0 0 0
14 2016-10-14 0 0 0
15 2016-10-15 0 0 1
16 2016-10-16 0 0 1
17 2016-10-17 0 0 1
18 2016-10-18 0 0 1
19 2016-10-19 0 0 1
20 2016-10-20 0 0 0
21 2016-10-21 0 0 0
22 2016-10-22 0 0 0
23 2016-10-23 0 0 0
24 2016-10-24 0 0 0
25 2016-10-25 0 0 0
26 2016-10-26 0 0 0
27 2016-10-27 0 0 0
28 2016-10-28 0 0 0
29 2016-10-29 0 0 0
30 2016-10-30 0 0 0
31 2016-10-31 0 0 0

Answer

Consider using zero loops but a Reduce() with merge across all df items of the dataframe list. However, you need to set up your data frames and list slightly different.

First, add as the first elmenet of list the sequential Date dataframe. Then, in each file you read in add a second column corresponding to A, B, C with each equal to one (which can be done in the lapply or for loop used in read in process -post this part for demonstration). Altogether, as shown below with all.equal an exact match results with original DF:

# INITIALIZE LIST WITH DATE SEQUENCE DF
newldf <- list(data.frame(Date = as.factor(seq(as.POSIXct("2016-10-01", tz = "UTC"), 
                                  as.POSIXct("2016-10-31", tz = "UTC"), 
                                  by = 'day'))))

# APPEND LIST OF DATA FRAMES THAT ARE READ IN, EACH WITH SECOND COL = 1
newldf <- append(newldf,
                list(data.frame(Date = c("2016-10-01", "2016-10-02", 
                                         "2016-10-03", "2016-10-04"), A = 1),
                     data.frame(Date = c("2016-10-07", "2016-10-08", 
                                         "2016-10-09"), B = 1),
                     data.frame(Date = c("2016-10-15", "2016-10-16", 
                                         "2016-10-17", "2016-10-18", "2016-10-19"), C=1)))

# MERGE ALL DATA FAMES TOGETHER
newDF <- Reduce(function(...) merge(..., by=c("Date"), all=T), newldf)
newDF[is.na(newDF)] <- 0                                # CONVERT NAs TO ZEROs
newDF$Date <- as.POSIXct(newDF$Date, tz = "UTC")        # CONVERT DATE TO POSIXct
str(newDF)
# 'data.frame': 31 obs. of  4 variables:
#  $ Date: POSIXct, format: "2016-10-01" "2016-10-02" ...
#  $ A   : num  1 1 1 0 0 0 0 0 0 0 ...
#  $ B   : num  0 0 0 0 0 0 1 1 1 0 ...
#  $ C   : num  0 0 0 0 0 0 0 0 0 0 ...

str(DF)
# 'data.frame': 31 obs. of  4 variables:
#  $ Date: POSIXct, format: "2016-10-01" "2016-10-02" ...
#  $ A   : num  1 1 1 0 0 0 0 0 0 0 ...
#  $ B   : num  0 0 0 0 0 0 1 1 1 0 ...
#  $ C   : num  0 0 0 0 0 0 0 0 0 0 ...

all.equal(DF, newDF)
# [1] TRUE