Sol - 6 months ago 38

R Question

I've been using the dplyr package to create aggregated data tables, for example using the following code:

`agg_data <- df %>%`

select(calc.method, price1, price2) %>%

group_by(calc.method) %>%

summarize(

count = n(),

mean_price1 = round(mean(price1, na.rm = TRUE),2),

mean_price2 = round(mean(price2, na.rm = TRUE),2))

However, I would like to only calculate the mean over the distinct values of price1 and price2 within groups

e.g:

Price1: 1 1 2 1 2 2 1

Goes to (before aggregation):

Price1: 1 2 1 2 1

(and these in general don't have the same numbers of after removal for price1 and price2). I would also like to calculate a count for each (price1 and price2), counting only distinct values within groups. (Groups are defined as two or more identical values adjacent to each other)

I have tried:

`agg_data <- df %>%`

select(calc.method, price1, price2) %>%

group_by(calc.method) %>%

summarize(

count = n(),

mean_price1 = round(mean(distinct(price1), na.rm = TRUE),2),

mean_price2 = round(mean(distinct(price2), na.rm = TRUE),2))

And also tried wrapping the columns within the select function with

`distinct()`

Is there a way to do this using dplyr or another similar package without having to write something from scratch?

Answer

To satisfy your requirement for *distinct*, we need to remove successive values that are the same. For numeric vectors, this can be accomplished by:

```
x <- x[c(1, which(diff(x) != 0)+1)]
```

The default use of `diff`

computes the difference between adjoining elements in the vector. We use this to detect successive values that are different, for which `diff(x) != 0`

. Since the output differences are lagged by `1`

, we add `1`

to the indices of these *distinct* elements, and we also want the first element as distinct. For example:

```
x <- c(1,1,2,1,2,2,1)
x <- x[c(1, which(diff(x) != 0)+1)]
##[1] 1 2 1 2 1
```

We can then use this with `dplyr`

:

```
agg_data <- df %>% group_by(calc.method) %>%
summarize(count = n(),
count_non_rep_1 = length(price1[c(1,which(diff(price1) != 0)+1)]),
mean_price1 = round(mean(price1[c(1,which(diff(price1) != 0)+1)], na.rm=TRUE),2),
count_non_rep_2 = length(price2[c(1,which(diff(price2) != 0)+1)]),
mean_price2 = round(mean(price2[c(1,which(diff(price2) != 0)+1)], na.rm=TRUE),2))
```

or, better yet, define the function:

```
remove.repeats <- function(x) {
x[c(1,which(diff(x) != 0)+1)]
}
```

and use it with `dplyr`

:

```
agg_data <- df %>% group_by(calc.method) %>%
summarize(count = n(),
count_non_rep_1 = length(remove.repeats(price1)),
mean_price1 = round(mean(remove.repeats(price1), na.rm=TRUE),2),
count_non_rep_2 = length(remove.repeats(price2)),
mean_price2 = round(mean(remove.repeats(price2), na.rm=TRUE),2))
```

Using this on some example data that is hopefully similar to yours:

```
df <- structure(list(calc.method = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
price1 = c(1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 3),
price2 = c(1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1)),
.Names = c("calc.method", "price1", "price2"), row.names = c(NA, -15L), class = "data.frame")
## calc.method price1 price2
##1 A 1 1
##2 A 1 1
##3 A 2 1
##4 A 1 1
##5 A 2 1
##6 A 2 1
##7 A 1 1
##8 B 1 2
##9 B 1 1
##10 B 2 2
##11 B 2 1
##12 B 2 2
##13 B 2 1
##14 B 1 2
##15 B 3 1
```

We get:

```
print(agg_data)
### A tibble: 2 x 6
## calc.method count count_non_rep_1 mean_price1 count_non_rep_2 mean_price2
## <fctr> <int> <int> <dbl> <int> <dbl>
##1 A 7 5 1.40 1 1.0
##2 B 8 4 1.75 8 1.5
```