jpg05 - 1 year ago 72
R Question

# Filter within groups where x first exceeds y

I have a database consisting of

• neighborhood IDs for home neighborhoods (id_h),

• block IDs for home blocks (
`blk_h`
,

• a sub-geography of neighborhoods),

• work blocks (
`blk_w`
),

• the flow of commuters between the two (
`Flow`
),

• the median commuter per home neighborhood (
`Med_C`
), and

• the cumulative worker flow by home neighborhood (
`CumFlow`
).

The data is sorted by distance between
`blk_h`
and
`blk_w`
(descending), grouped by
`id_h`
. I need to subset the data to extract the case for each home neighborhood where
`CumFlow`
FIRST equals or exceeds
`Med_C`
.

I've tried a variety of dplyr functions and cannot get it to work. Here's an example:

``````df <- data.frame(
id_h=c("A","A","A","A","B","B","B"),
blk_h=c("A1","A1","A2","A2","B1","B2","B2"),
blk_w=c("W1","W2","W3","W3","W1","W2","W2"),
dist=c(4.3,5.6,7.0,8.7,5.2,6.5,6.8),
Flow=c(3,6,3,7,5,4,2),
CumFlow=c(3,9,12,19,5,9,11),
Med_C=c(10,10,10,10,6,6,6)
)
df
``````

I need for this to return a table like this:

``````id_h  blk_h  blk_w  dist  Flow  CumFlow  Med_C
A     A2     W3     7.0   3     12       10
B     B2     W2     6.5   4     9        6
``````

And here are some of the things I've tried to make this happen:
Attempt #1

``````library(dplyr)
df.g <- group_by(df, id_h)
df.g2 <- filter(df.g, CumFlow == which.min(CumFlow >= Med_C))
``````

Attempt #2

``````library(data.table)
setDT(df)[, .SD[which.min(CumCount >= Med_C)], by = id_h]
``````

Attempt #3

``````library(dplyr)
test <- df %>% group_by(id_h) %>% filter(min(CumFlow) >= Med_C)
``````

I think I am misunderstanding how to use the
`which.min`
function. Any advice is greatly appreciated.

Two `filter` calls can solve this.

Using `group_by` to work within each `id_h`, the first `filter` returns a `data.frame` with all rows where the `CumFlow` is greater than or equal to the `Med_C`. The second `filter` returns, within each `id_h`, the row with the lowest `CumFlow`. This only works because the data is sorted. To make the work more robust you may consider adding a call to `arrange` after the call to `group_by`.

``````library(dplyr)

df <- data.frame(
id_h    = c("A","A","A","A","B","B","B"),
blk_h   = c("A1","A1","A2","A2","B1","B2","B2"),
blk_w   = c("W1","W2","W3","W3","W1","W2","W2"),
dist    = c(4.3,5.6,7.0,8.7,5.2,6.5,6.8),
Flow    = c(3,6,3,7,5,4,2),
CumFlow = c(3,9,12,19,5,9,11),
Med_C   = c(10,10,10,10,6,6,6)
)
df

df %>%
group_by(id_h) %>%
filter(CumFlow >= Med_C) %>%
filter(CumFlow == min(CumFlow))
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download