Mihael - 1 year ago 75

R Question

I want to collapse the dataset below across the Id_no, but only when other variables meet certain conditions, and calculating a special collapse variable.

`df <- structure(list(Id_no = structure(c(1L, 1L, 1L, 2L, 2L, 3L), .Label = c("n1",`

"n2", "n3"), class = "factor"), Band = structure(c(1L, 2L, 3L,

1L, 2L, 1L), .Label = c("Band 1", "Band 2", "Band 3"), class = "factor"),

median = c(252, 191, 107, 130.5, 61.5, 217), sample_size = c(19L,

20L, 1L, 20L, 12L, 1544L)), .Names = c("Id_no", "Band", "median",

"sample_size"), class = "data.frame", row.names = c(NA, -6L))

> df

Id_no Band median sample_size

1 n1 Band 1 252.0 19

2 n1 Band 2 191.0 20

3 n1 Band 3 107.0 1

4 n2 Band 1 130.5 20

5 n2 Band 2 61.5 12

6 n3 Band 1 217.0 1544

The collapse variable would be the ratio of a median in Band 1 vs mean median between Bands 2 and 3.

This would be calculated for Id_no that actually have entries for both B1 and at least one of B2 or B3.

In addition, the B2 and B3 medians should be considered only if sample size is >=10.

The resulting table I'm looking for is this:

`Id_no b1_vs_rest`

1 n1 1.32

2 n2 2.12

The first step is probably to remove cases where sample size doesn't meet the minimum:

`df <- subset(df, sample_size >=10)`

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

One approach using `dplyr`

:

```
library(dplyr)
res <- df %>% group_by(Id_no) %>%
filter(sample_size >= 10) %>%
summarise(b1_vs_rest=median[Band == "Band 1"]/mean(median[Band != "Band 1"])) %>%
filter(!is.nan(b1_vs_rest))
```

Notes:

- With
`dplyr`

, first`filter`

to keep only those rows with`sample_size`

greater than or equal to`10`

. - Then
`group_by`

`Id_no`

and`summarise`

to compute the ratio for each`Id_no`

selecting for the numerator the`median`

for`Band == "Band 1"`

and for the denominator the medians for all other Bands. - If neither
`"Band 1"`

nor`"Band 2"`

exists, then`mean`

will return`NaN`

and the result will also be`NaN`

. So, we`filter`

again to keep only non-`NaN`

values.

The result using your data is as expected:

```
print(res)
### A tibble: 2 x 2
## Id_no b1_vs_rest
## <fctr> <dbl>
##1 n1 1.319372
##2 n2 2.121951
```

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**