i.Mik i.Mik - 2 months ago 12
R Question

Avoiding lapply() in R, and finding all elements of Vector B that meet a condition of for each element of Vector A

I have two vectors. For each element of vector A, I would like to know all the elements of vector B that fulfill a certain condition. So, for example, two dataframes containing the vectors:

person <- data.frame(name = c("Albert", "Becca", "Celine", "Dagwood"),
tickets = c(20, 24, 16, 17))
prize <- data.frame(type = c("potato", "lollipop", "yo-yo", "stickyhand",
"moodring", "figurine", "whistle", "saxophone"),
cost = c(6, 11, 13, 17, 21, 23, 25, 30))


For this example, each person in the "person" dataframe has a number of tickets from a carnival game, and each prize in the "prize" dataframe has a cost. But I'm not looking for perfect matches; instead of simply buying a prize, they randomly receive any prize that is within a 5-ticket cost tolerance of what they have.

The output I'm looking for is a dataframe of all the possible prizes each person could win. It would be something like:

person prize
1 Albert stickyhand
2 Albert moodring
3 Albert figurine
4 Albert whistle
5 Becca moodring
6 Becca figurine
... ...


And so on. Right now, I'm doing this with
lapply()
, but this is really no faster than a
for()
loop in R.

library(dplyr)
matching_Function <- function(person, prize, tolerance = 5){
matchlist <- lapply(split(person, list(person$name)),
function(x) filter(prize, abs(x$tickets-cost)<=tolerance)$type)
longlist <- data.frame("person" = rep(names(matchlist),
times = unlist(lapply(matchlist, length))),
"prize" = unname(unlist(matchlist))
)
return(longlist)
}
matching_Function(person, prize)


My actual datasets are much larger (in the hundreds of thousands), and my matching conditions are more complicated (checking coordinates from B to see whether they are within a set radius of coordinates from A), so this is taking forever (several hours).

Are there any smarter ways than
for()
and
lapply()
to solve this?

Answer

An alternative with foverlaps from data.table doing what you wish:

require(data.table)

# Turn the datasets into data.table
setDT(person)
setDT(prize)
# Add the min and max from tolerance
person[,`:=`(start=tickets-tolerance,end=tickets+tolerance)]
# add a dummy column for use as range
prize[,dummy:=cost]
# Key the person table on start and end
setkey(person,start,end)
# As foverlaps to get the corresponding rows from prize into person, filter the NA results and return only the name and type of prize
r<-foverlaps(prize,person,type="within",by.x=c("cost","dummy"))[!is.na(name),list(name=name,prize=type)]
# Re order the result by name instead of prize cost
setorder(r,name)

Output:

       name      prize
 1:  Albert stickyhand
 2:  Albert   moodring
 3:  Albert   figurine
 4:  Albert    whistle
 5:   Becca   moodring
 6:   Becca   figurine
 7:   Becca    whistle
 8:  Celine   lollipop
 9:  Celine      yo-yo
10:  Celine stickyhand
11:  Celine   moodring
12: Dagwood      yo-yo
13: Dagwood stickyhand
14: Dagwood   moodring

I hope I commented enough the code to be self explanatory.


For the second part of the question, using coordinates and testing within a radius.

person <- structure(list(name = c("Albert", "Becca", "Celine", "Dagwood"), 
                         x = c(26, 16, 32, 51), 
                         y = c(92, 51, 25, 4)), 
                    .Names = c("name", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
antenas <- structure(list(name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L"), 
                          x = c(40, 25, 38, 17, 58, 19, 34, 38, 67, 26, 46, 17), 
                          y = c(36, 72, 48, 6, 78, 41, 18, 28, 54, 8, 28, 47)), 
                     .Names = c("name", "x", "y"), row.names = c(NA, -12L), class = "data.frame")

setDT(person)
setDT(antenas)
r<-10

results <- person[,{dx=x-antenas$x;dy=y-antenas$y; list(antena=antenas$name[dx^2+dy^2<=r^2])},by=name]

Data.table allow expression in j, so we can do the maths of the outer join for each person against antennas and return only relevant rows with antenna name.

This should not be to much memory consuming as it's done for each row on person and not as a whole.

Maths inspired by this question

This give:

> results
     name antena
1:  Becca      L
2: Celine      G
3: Celine      H