Ronja Ronja - 1 month ago 8
R Question

Combining "aggregate" with a user-made function

I am trying to extract information from a dataset of climatic variables (daily values), spanning multiple years. Presence and absence of snow is coded as 1 and 0 in the data set. Each line corresponds to a day and I have a columns cody for date and half year.
For each half year period I am attempting to find the value of the date after the last occurence of snow presence (1). I tried this by grouping the variables via aggregate, but the use of a custom function did not succeed, since "aggregate" seems to pass the columns of the dataframe without a name.

dates<-c("1993-01-01","1993-01-02","1993-01-03","1993-01-04","1993-01-05","1994-02-20","1994-02-21","1994-02-22","1994-02-23","1994-02-24")
df<-data.frame(Date=as.Date(dates,format = "%Y-%m-%d"),
halves=as.factor(c(1993-01-01,1993-01-01,1993-01-01,1993-01-01,1993-01-01,1994-01-01,1994-01-01,1994-01-01,1994-01-01,1994-01-01)),
plot1=c(1,1,1,0,0,1,1,0,0,0),
plot2=c(1,1,0,0,0,1,1,0,1,0),
plot3=c(0,1,1,1,0,1,1,1,0,0))


I know that using loops is not that efficient, but since I want to avoid applying the function to the "not-plot-columns" I am using:

for(plots in names(df)[- which(names(df) %in% c("Date","halves"))]){
meltday[[plots<-aggregate(df[[plots]]~halves,df,df$Date[last(which(snow.days.half$Date==0)) + 1])
}


This produces a error, since the last part is not evaluated as a function.
So I tried with a home-made function to find the Date+1 of the last occurence of 1 in all plot colums.

snowmelt<-function(x)
{snowmelt<-max(x[[Date]][x[[plots]]==1])
snowmelt}


But then trying

for(plots in names(df)[- which(names(df) %in% c("Date","halves"))]){
meltday[[plots]]<-aggregate(df[[plots]] ~ halves,df,snowmelt)
}


left me with an error statement, since the function coerces its input to a list.

So, I am quite stuck and will be very thankful and apprechiative for any comments or answers that point me into the right direction.
My desired output would be a dataframe with dates in it: along the lines of

>meltday
halves plot1 plot2 plot3
1993-01-04 1993-01-04 1993-01-03 1993-01-05
1994-01-01 1994-02-22 1994-02-24 1994-02-23


EDIT: added desired output for clarification.

Thanks!

Answer

Using tidyr and dplyr

I changed the way you input the halves variable in the OP to use a character vector (I think that might've been a bug)

dates<-c("1993-01-01","1993-01-02","1993-01-03","1993-01-04","1993-01-05","1994-02-20","1994-02-21","1994-02-22","1994-02-23","1994-02-24")
df<-data.frame(Date=as.Date(dates,format = "%Y-%m-%d"),
               halves=as.factor(c('1993-01-01','1993-01-01','1993-01-01','1993-01-01','1993-01-01','1994-01-01','1994-01-01','1994-01-01','1994-01-01','1994-01-01')),
               plot1=c(1,1,1,0,0,1,1,0,0,0),
               plot2=c(1,1,0,0,0,1,1,0,1,0),
               plot3=c(0,1,1,1,0,1,1,1,0,0))

Then I use the gather command to put the data into a long-format, which I then turn into a grouped data-frame for which the summarise function can aggregate for each plot. I put the data back into a wide-format using spread.

df %>%
gather(plot, snow, plot1:plot3) %>%
group_by(plot, halves) %>%
arrange(Date) %>%
summarise(meltday=Date[max(which(snow==T))+1]) %>%
spread(plot, meltday)

The output is

# A tibble: 2 × 4
      halves      plot1      plot2      plot3
*     <fctr>     <date>     <date>     <date>
1 1993-01-01 1993-01-04 1993-01-03 1993-01-05
2 1994-01-01 1994-02-22 1994-02-24 1994-02-23

Note: this will have an error if the meltday hasn't occurred (i.e. there is snow in the final entry). I haven't thought of a neater way to detect the 'last' snow day yet.