Mihael Mihael - 1 month ago 7
R Question

Collapse cases in a dataframe only under certain conditions

I want to collapse the dataset below across the Id_no, but only when other variables meet certain conditions, and calculating a special collapse variable.

df <- structure(list(Id_no = structure(c(1L, 1L, 1L, 2L, 2L, 3L), .Label = c("n1",
"n2", "n3"), class = "factor"), Band = structure(c(1L, 2L, 3L,
1L, 2L, 1L), .Label = c("Band 1", "Band 2", "Band 3"), class = "factor"),
median = c(252, 191, 107, 130.5, 61.5, 217), sample_size = c(19L,
20L, 1L, 20L, 12L, 1544L)), .Names = c("Id_no", "Band", "median",
"sample_size"), class = "data.frame", row.names = c(NA, -6L))

> df
Id_no Band median sample_size
1 n1 Band 1 252.0 19
2 n1 Band 2 191.0 20
3 n1 Band 3 107.0 1
4 n2 Band 1 130.5 20
5 n2 Band 2 61.5 12
6 n3 Band 1 217.0 1544


The collapse variable would be the ratio of a median in Band 1 vs mean median between Bands 2 and 3.

This would be calculated for Id_no that actually have entries for both B1 and at least one of B2 or B3.

In addition, the B2 and B3 medians should be considered only if sample size is >=10.

The resulting table I'm looking for is this:

Id_no b1_vs_rest
1 n1 1.32
2 n2 2.12


The first step is probably to remove cases where sample size doesn't meet the minimum:

df <- subset(df, sample_size >=10)

Answer

One approach using dplyr:

library(dplyr)
res <- df %>% group_by(Id_no) %>% 
              filter(sample_size >= 10) %>%
              summarise(b1_vs_rest=median[Band == "Band 1"]/mean(median[Band != "Band 1"])) %>%
              filter(!is.nan(b1_vs_rest))

Notes:

  1. With dplyr, first filter to keep only those rows with sample_size greater than or equal to 10.
  2. Then group_by Id_no and summarise to compute the ratio for each Id_no selecting for the numerator the median for Band == "Band 1" and for the denominator the medians for all other Bands.
  3. If neither "Band 1" nor "Band 2" exists, then mean will return NaN and the result will also be NaN. So, we filter again to keep only non-NaN values.

The result using your data is as expected:

print(res)
### A tibble: 2 x 2
##   Id_no b1_vs_rest
##  <fctr>      <dbl>
##1     n1   1.319372
##2     n2   2.121951