Sol Sol - 2 months ago 9
R Question

A clean way to aggregate/groupby involving only the distinct values in each column?

I've been using the dplyr package to create aggregated data tables, for example using the following code:

agg_data <- df %>%
select(calc.method, price1, price2) %>%
group_by(calc.method) %>%
summarize(
count = n(),
mean_price1 = round(mean(price1, na.rm = TRUE),2),
mean_price2 = round(mean(price2, na.rm = TRUE),2))


However, I would like to only calculate the mean over the distinct values of price1 and price2 within groups

e.g:

Price1: 1 1 2 1 2 2 1

Goes to (before aggregation):

Price1: 1 2 1 2 1

(and these in general don't have the same numbers of after removal for price1 and price2). I would also like to calculate a count for each (price1 and price2), counting only distinct values within groups. (Groups are defined as two or more identical values adjacent to each other)

I have tried:

agg_data <- df %>%
select(calc.method, price1, price2) %>%
group_by(calc.method) %>%
summarize(
count = n(),
mean_price1 = round(mean(distinct(price1), na.rm = TRUE),2),
mean_price2 = round(mean(distinct(price2), na.rm = TRUE),2))


And also tried wrapping the columns within the select function with
distinct()
, but both these throw errors.

Is there a way to do this using dplyr or another similar package without having to write something from scratch?

Answer

To satisfy your requirement for distinct, we need to remove successive values that are the same. For numeric vectors, this can be accomplished by:

x <- x[c(1, which(diff(x) != 0)+1)]

The default use of diff computes the difference between adjoining elements in the vector. We use this to detect successive values that are different, for which diff(x) != 0. Since the output differences are lagged by 1, we add 1 to the indices of these distinct elements, and we also want the first element as distinct. For example:

x <- c(1,1,2,1,2,2,1)
x <- x[c(1, which(diff(x) != 0)+1)]
##[1] 1 2 1 2 1

We can then use this with dplyr:

agg_data <- df %>% group_by(calc.method) %>%
                   summarize(count = n(),
                             count_non_rep_1 = length(price1[c(1,which(diff(price1) != 0)+1)]),
                             mean_price1 = round(mean(price1[c(1,which(diff(price1) != 0)+1)], na.rm=TRUE),2),
                             count_non_rep_2 = length(price2[c(1,which(diff(price2) != 0)+1)]),
                             mean_price2 = round(mean(price2[c(1,which(diff(price2) != 0)+1)], na.rm=TRUE),2))

or, better yet, define the function:

remove.repeats <- function(x) {
  x[c(1,which(diff(x) != 0)+1)]
}

and use it with dplyr:

agg_data <- df %>% group_by(calc.method) %>%
                   summarize(count = n(),
                             count_non_rep_1 = length(remove.repeats(price1)),
                             mean_price1 = round(mean(remove.repeats(price1), na.rm=TRUE),2),
                             count_non_rep_2 = length(remove.repeats(price2)),                             
                             mean_price2 = round(mean(remove.repeats(price2), na.rm=TRUE),2))

Using this on some example data that is hopefully similar to yours:

df <- structure(list(calc.method = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), 
price1 = c(1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 3), 
price2 = c(1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1)), 
.Names = c("calc.method", "price1", "price2"), row.names = c(NA, -15L), class = "data.frame")
##   calc.method price1 price2
##1            A      1      1
##2            A      1      1
##3            A      2      1
##4            A      1      1
##5            A      2      1
##6            A      2      1
##7            A      1      1
##8            B      1      2
##9            B      1      1
##10           B      2      2
##11           B      2      1
##12           B      2      2
##13           B      2      1
##14           B      1      2
##15           B      3      1

We get:

print(agg_data)
### A tibble: 2 x 6
##  calc.method count count_non_rep_1 mean_price1 count_non_rep_2 mean_price2
##       <fctr> <int>           <int>       <dbl>           <int>       <dbl>
##1           A     7               5        1.40               1         1.0
##2           B     8               4        1.75               8         1.5
Comments