i.Mik i.Mik - 1 month ago 4x
R Question

Avoiding lapply() in R, and finding all elements of Vector B that meet a condition of for each element of Vector A

I have two vectors. For each element of vector A, I would like to know all the elements of vector B that fulfill a certain condition. So, for example, two dataframes containing the vectors:

person <- data.frame(name = c("Albert", "Becca", "Celine", "Dagwood"),
tickets = c(20, 24, 16, 17))
prize <- data.frame(type = c("potato", "lollipop", "yo-yo", "stickyhand",
"moodring", "figurine", "whistle", "saxophone"),
cost = c(6, 11, 13, 17, 21, 23, 25, 30))

For this example, each person in the "person" dataframe has a number of tickets from a carnival game, and each prize in the "prize" dataframe has a cost. But I'm not looking for perfect matches; instead of simply buying a prize, they randomly receive any prize that is within a 5-ticket cost tolerance of what they have.

The output I'm looking for is a dataframe of all the possible prizes each person could win. It would be something like:

person prize
1 Albert stickyhand
2 Albert moodring
3 Albert figurine
4 Albert whistle
5 Becca moodring
6 Becca figurine
... ...

And so on. Right now, I'm doing this with
, but this is really no faster than a
loop in R.

matching_Function <- function(person, prize, tolerance = 5){
matchlist <- lapply(split(person, list(person$name)),
function(x) filter(prize, abs(x$tickets-cost)<=tolerance)$type)
longlist <- data.frame("person" = rep(names(matchlist),
times = unlist(lapply(matchlist, length))),
"prize" = unname(unlist(matchlist))
matching_Function(person, prize)

My actual datasets are much larger (in the hundreds of thousands), and my matching conditions are more complicated (checking coordinates from B to see whether they are within a set radius of coordinates from A), so this is taking forever (several hours).

Are there any smarter ways than
to solve this?


An alternative with foverlaps from data.table doing what you wish:


# Turn the datasets into data.table
# Add the min and max from tolerance
# add a dummy column for use as range
# Key the person table on start and end
# As foverlaps to get the corresponding rows from prize into person, filter the NA results and return only the name and type of prize
# Re order the result by name instead of prize cost


       name      prize
 1:  Albert stickyhand
 2:  Albert   moodring
 3:  Albert   figurine
 4:  Albert    whistle
 5:   Becca   moodring
 6:   Becca   figurine
 7:   Becca    whistle
 8:  Celine   lollipop
 9:  Celine      yo-yo
10:  Celine stickyhand
11:  Celine   moodring
12: Dagwood      yo-yo
13: Dagwood stickyhand
14: Dagwood   moodring

I hope I commented enough the code to be self explanatory.

For the second part of the question, using coordinates and testing within a radius.

person <- structure(list(name = c("Albert", "Becca", "Celine", "Dagwood"), 
                         x = c(26, 16, 32, 51), 
                         y = c(92, 51, 25, 4)), 
                    .Names = c("name", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
antenas <- structure(list(name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L"), 
                          x = c(40, 25, 38, 17, 58, 19, 34, 38, 67, 26, 46, 17), 
                          y = c(36, 72, 48, 6, 78, 41, 18, 28, 54, 8, 28, 47)), 
                     .Names = c("name", "x", "y"), row.names = c(NA, -12L), class = "data.frame")


results <- person[,{dx=x-antenas$x;dy=y-antenas$y; list(antena=antenas$name[dx^2+dy^2<=r^2])},by=name]

Data.table allow expression in j, so we can do the maths of the outer join for each person against antennas and return only relevant rows with antenna name.

This should not be to much memory consuming as it's done for each row on person and not as a whole.

Maths inspired by this question

This give:

> results
     name antena
1:  Becca      L
2: Celine      G
3: Celine      H