Matt W. Matt W. - 4 years ago 121
R Question

Randomly subset each group to satisfy conditions

Looking to reduce resource allocation by looping through each resource's name, and looking at the assigned accounts to that persons name, selecting one at random and replacing that person's name with NA.

reproducible example:

Accts <- paste0("Acc", 1:200)
Value <- c(500, 2000, 5000, 1000)
AccountDF <- data.frame(Accts, Value)
AccountDF$Owner[1:200] <- NA
AccountDF$Owner[1:23] <- "Jeff"
AccountDF$Owner[24:37] <- "Alex"
AccountDF$Owner[38:61] <- "Steph"
AccountDF$Owner[62:111] <- "Matt"
AccountDF$Owner[112:141] <- "David"

library(dplyr)
OwnerDF <- AccountDF %>%
group_by(Owner) %>%
summarise(Count = n(),
TotalValue = sum(Value)) %>%
filter(!is.na(Owner))


Where I got so far:

for (p in 1:nrow(OwnerDF)){
while (AccountDF$Count[p] > 22){
AccountDF %>%
filter(Owner == OwnerDF$Owner[p]) %>%
sample_n(1)


}
}


I've heard that for loops are unnecessary. I'm sure this can be done with the purr package and pmap or something like that. I am still learning.

I would like to iterate through the OwnerDF and look at whether that person "owns" too many accounts. If yes, look at the original account list and select a random one and replace the owner's name with NA, remove 1 from their count, and continue on.

Lastly after figuring this out I would like to see if it can be done with multiple conditions.. like While(Count > 22 & Value > $40,000), or maybe two while loops. The object is to reduce each person's "owned" accounts to less than a certain threshold and reduce $$ to less than a certain threshold.

Answer Source

To select random accounts, just make a random var and sort on it, taking the first N accounts that meet your conditions:

 set.seed(1)
 res = AccountDF %>% 
   mutate(r = runif(n())) %>% 
   arrange(r) %>% 
   group_by(Owner) %>% 
   mutate(newOwner = replace(Owner, cumsum(Value) > 40000 | row_number() > 22, NA)) %>% 
   select(-r)

# Test that it worked...
res %>%
  filter(!is.na(newOwner)) %>%
  group_by(newOwner) %>%
  summarise(Count = n(), TotalValue = sum(Value))

# A tibble: 5 x 3
#   newOwner Count TotalValue
#      <chr> <int>      <dbl>
# 1     Alex    14      27000
# 2    David    18      37000
# 3     Jeff    18      39500
# 4     Matt    18      39500
# 5    Steph    17      36500

An extension mentioned by the OP in a comment:

Another question for you. Say I have a threshold for each value and count, and if someone has a low count but high value, I want to take a random account from their high value accounts, if they have a high count and low value, I want to take low value accounts away from them. How can I do this from a random perspective?

I'd probably assign a real-valued score to each observation, like...

s = scale(f(x))

where f is some function based on the conditions you mentioned (high count, high value or both), maybe as simple as x when you want to bias towards the low values and -x when you want to bias towards the high values.

Then, add on some noise and sort using the result as above:

r = s + rnorm(length(s))
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download