Arihant Arihant - 2 months ago 6
R Question

Match and summarize dataframe in R

I have a dataframe with timeseries data columns and start year and end year.

df = data.frame(y2000=c(12,636),y2001=c(234, 76),y2002=c(3434, 46),y2003=c(36,35),y2004=c(6, 64), y2005=c(56,65), y2006=c(43,65), y2007=c( 6, 56),y2008=c( 64, 66),y2009=c(63, 5656),y2010 = c(65,54),startyear= c(2006, 2001), endyear= c(2009, 2005))


For each row I want to calculate mean value within the start and end years and also before and after the start and end period. The desired output looks like:

y2000 y2001 y2002 y2003 y2004 y2005 y2006 y2007 y2008 y2009 y2010 startyear endyear before_mean within_mean after_mean
12 234 3434 36 6 56 43 6 64 63 65 2006 2009 629.6666667 44 65
636 76 46 35 64 65 65 56 66 5656 54 2001 2005 636 57.2 1179.4


I have tried different match and index technique but can't wrap my mind around this one.

Answer

1. dplyr/tidyr

It may be better to convert the 'wide' format to 'long' format. We could use dplyr/tidyr to get the mean. Create a 'ind' column, reshape the data to 'long' using gather, split the 'variable' column into two columns ('var1', 'var2') with extract, group by 'ind', get the mean values of the 'value' column after subsetting it based on the different logical index created (i.e. var2 < startyear, var2 >= startyear & var2 <= endyear, and var2 >endyear)

library(dplyr)
library(tidyr)

dS <-  df %>%
          mutate(ind=row_number()) %>%
          gather(variable, value, starts_with('y')) %>%
          extract(variable, c('var1', 'var2'), '([^0-9]+)([0-9]+)',
                        convert=TRUE) %>%
          group_by(ind) %>%
          summarise(before_mean= mean(value[var2 < startyear]), 
                   within_mean = mean(value[var2 >= startyear & 
                                            var2 <= endyear]),
                   after_mean=mean(value[var2 >endyear])) %>% 
         as.data.frame()

nm1 <-  paste(c('before', 'within', 'after'), 'mean', sep="_")
dS
#   ind before_mean within_mean after_mean
#1   1    629.6667        44.0       65.0
#2   2    636.0000        57.2     1179.4

We can create additional columns in 'df' from the above output

df[nm1] <- dS

2. base R

We can use base R methods and without changing the format of the dataset. From the original dataset ('df'), make an index ('indx') of numeric column names, remove the non-numeric part and convert to numeric ('v1').

 indx <- grep('\\d+', names(df))
 v1 <- as.numeric(sub('[^0-9]+', '', names(df)[indx]))

Loop the rows of 'df' (lapply), match the 'startyear' with 'v1', use that index ('i1') to get the columns, unlist, and calculate the mean. The same can be done by matching the 'endyear' with 'v1' to get the index ('i2'). Based on 'i1', and 'i2', calculate the 'within_mean' and 'after_mean'. rbind the list elements and assign the output to new columns ('nm1') in 'df'.

df[nm1] <- do.call(rbind,lapply(1:nrow(df), function(i) {
       i1 <- match(df$startyear[i], v1)
       before_mean<-  mean(unlist(df[i,1:(i1-1),drop=FALSE]))
       i2 <- match(df$endyear[i], v1)
       within_mean <- mean(unlist(df[i,i2:i1]))
      after_mean <- mean(unlist(df[i,match(v1[(i2+1):length(v1)],v1)]))
       data.frame(before_mean,within_mean, after_mean) }))
 df[nm1]
 #    before_mean within_mean after_mean
 #1    629.6667        44.0       65.0
 #2    636.0000        57.2     1179.4
Comments