jpg05 jpg05 - 4 months ago 8
R Question

Filter within groups where x first exceeds y

I have a database consisting of


  • neighborhood IDs for home neighborhoods (id_h),

  • block IDs for home blocks (
    blk_h
    ,

  • a sub-geography of neighborhoods),

  • work blocks (
    blk_w
    ),

  • the flow of commuters between the two (
    Flow
    ),

  • the median commuter per home neighborhood (
    Med_C
    ), and

  • the cumulative worker flow by home neighborhood (
    CumFlow
    ).



The data is sorted by distance between
blk_h
and
blk_w
(descending), grouped by
id_h
. I need to subset the data to extract the case for each home neighborhood where
CumFlow
FIRST equals or exceeds
Med_C
.

I've tried a variety of dplyr functions and cannot get it to work. Here's an example:

df <- data.frame(
id_h=c("A","A","A","A","B","B","B"),
blk_h=c("A1","A1","A2","A2","B1","B2","B2"),
blk_w=c("W1","W2","W3","W3","W1","W2","W2"),
dist=c(4.3,5.6,7.0,8.7,5.2,6.5,6.8),
Flow=c(3,6,3,7,5,4,2),
CumFlow=c(3,9,12,19,5,9,11),
Med_C=c(10,10,10,10,6,6,6)
)
df


I need for this to return a table like this:

id_h blk_h blk_w dist Flow CumFlow Med_C
A A2 W3 7.0 3 12 10
B B2 W2 6.5 4 9 6


And here are some of the things I've tried to make this happen:
Attempt #1

library(dplyr)
df.g <- group_by(df, id_h)
df.g2 <- filter(df.g, CumFlow == which.min(CumFlow >= Med_C))


Attempt #2

library(data.table)
setDT(df)[, .SD[which.min(CumCount >= Med_C)], by = id_h]


Attempt #3

library(dplyr)
test <- df %>% group_by(id_h) %>% filter(min(CumFlow) >= Med_C)


I think I am misunderstanding how to use the
which.min
function. Any advice is greatly appreciated.

Answer

Two filter calls can solve this.

Using group_by to work within each id_h, the first filter returns a data.frame with all rows where the CumFlow is greater than or equal to the Med_C. The second filter returns, within each id_h, the row with the lowest CumFlow. This only works because the data is sorted. To make the work more robust you may consider adding a call to arrange after the call to group_by.

library(dplyr)

df <- data.frame(
  id_h    = c("A","A","A","A","B","B","B"),
  blk_h   = c("A1","A1","A2","A2","B1","B2","B2"),
  blk_w   = c("W1","W2","W3","W3","W1","W2","W2"),
  dist    = c(4.3,5.6,7.0,8.7,5.2,6.5,6.8),
  Flow    = c(3,6,3,7,5,4,2),
  CumFlow = c(3,9,12,19,5,9,11),
  Med_C   = c(10,10,10,10,6,6,6)
)
df

df %>%
group_by(id_h) %>%
filter(CumFlow >= Med_C) %>%
filter(CumFlow == min(CumFlow))