ayshelina ayshelina - 8 days ago 4
R Question

R - dividing a huge dataframe of latitude/longitude points into groups according to location

I am new at R, but I hear that it is really a bad idea to use

for
loops. I have working code using them, but I would like to improve it because it's extremely slow with big data. I already have a few ideas how to improve the algorithm, but what I don't know is how to vectorize this, or do it without
for
loops.

I am simply grouping lat/lng points into a circles with radius as parameter.

An example output of the function(only fills the values in the circle_id column), radius was set to 100 meters:

[1] "Locations: "
latitude longitude sensor_time sensor_time2 circle_id
48.15144 17.07569 1447149703 2015-11-10 11:01:43 1
48.15404 17.07452 1447149743 2015-11-10 11:02:23 2
48.15277 17.07514 1447149762 2015-11-10 11:02:42 3
48.15208 17.07538 1447149771 2015-11-10 11:02:51 1
48.15461 17.07560 1447149773 2015-11-10 11:02:53 4
48.15139 17.07562 1447149811 2015-11-10 11:03:31 1
48.15446 17.07517 1447149866 2015-11-10 11:04:26 2
48.15266 17.07330 1447149993 2015-11-10 11:06:33 5


So I have 2 for loops, loop1 goes through every line and loop2 goes through every previous circle_id's and checks if current location from loop1 is within the radius of existing circles from loop2. The centre of each circle_id is the first location found outside all previous one's radius.

Here's the code:

init_circles = function(datfr, radius) {
cnt = 1
datfr$circle_id[1] = 1
longitude = datfr$longitude[1]
latitude = datfr$latitude[1]
circle_id = datfr$circle_id[1]
datfr2 <- data.frame(longitude, latitude, circle_id)

for (i in 2:NROW(datfr)) {
for (j in 1:NROW(datfr2)) {
tmp = distHaversine(c(datfr$longitude[i],datfr$latitude[i]) ,c(datfr2$longitude[j],datfr2$latitude[j]))
if (tmp < radius){
datfr$circle_id[i] = datfr2$circle_id[j]
break
}
}
if (datfr$circle_id[i]<1){
cnt = cnt +1
datfr$circle_id[i] = cnt
datfr2[nrow(datfr2)+1,] = c(datfr$longitude[i],datfr$latitude[i],datfr$circle_id[i])
}
}
return(datfr)
}


datfr is the input dataframe without circle_id's set, and datfr2 is a temporary dataframe containing already existing circles.

EDIT: here is a visual output:

enter image description here

You can see what those circles are used for, the upper red circle has 21 other locations that fit within its radius (21 + 1 original = 22)

Thank you so much for helping,
Alena

Answer

I've assumed we have a data frame circles with the center and radius of each circle and that the sample data posted in your question is in a data frame called dat. The code below vectorizes the calculation of distance and uses lapply to calculate the distance of each point from the center of each circle and to determine if each point is inside the radius of that circle.

library(geosphere)

# We'll check the distance of each data point from the center of each 
#  of these circles
circles = data.frame(ID=1:2, lon=c(17.074, 17.076), lat=c(48.1513, 48.15142), 
                     radius=c(180,190))

datNew = lapply(1:nrow(circles), function(i) {

  df = dat

  df$dist = distHaversine(df[,c("longitude", "latitude")], 
                          circles[rep(i,nrow(df)), c('lon','lat')])

  df$in_circle = ifelse(df$dist <= circles[i, "radius"], "Yes", "No")

  df$circle_id = circles[i, "ID"]

  df

})

datNew = do.call(rbind, datNew)

datNew
   latitude longitude sensor_time sensor_time2    time3      dist in_circle circle_id
1  48.15144  17.07569  1447149703   2015-11-10 11:01:43 126.47756       Yes         1
2  48.15404  17.07452  1447149743   2015-11-10 11:02:23 307.45048        No         1
3  48.15277  17.07514  1447149762   2015-11-10 11:02:42 184.24465        No         1
4  48.15208  17.07538  1447149771   2015-11-10 11:02:51 134.32601       Yes         1
5  48.15461  17.07560  1447149773   2015-11-10 11:02:53 387.15358        No         1
6  48.15139  17.07562  1447149811   2015-11-10 11:03:31 120.73138       Yes         1
7  48.15446  17.07517  1447149866   2015-11-10 11:04:26 362.34236        No         1
8  48.15266  17.07330  1447149993   2015-11-10 11:06:33 160.07179       Yes         1
9  48.15144  17.07569  1447149703   2015-11-10 11:01:43  23.13059       Yes         2
10 48.15404  17.07452  1447149743   2015-11-10 11:02:23 311.68096        No         2
11 48.15277  17.07514  1447149762   2015-11-10 11:02:42 163.29068       Yes         2
12 48.15208  17.07538  1447149771   2015-11-10 11:02:51  86.70762       Yes         2
13 48.15461  17.07560  1447149773   2015-11-10 11:02:53 356.34955        No         2
14 48.15139  17.07562  1447149811   2015-11-10 11:03:31  28.41890       Yes         2
15 48.15446  17.07517  1447149866   2015-11-10 11:04:26 343.97933        No         2
16 48.15266  17.07330  1447149993   2015-11-10 11:06:33 243.44024        No         2

So we now have a data frame telling us whether each point is inside a given circle. The data frame is in long format, meaning that there are n rows for each point in the original data frame dat, where n is the number of rows in the circles data frame. From here, you can do further processing, such as just keeping one row for each point that's in multiple circles, etc.

Here's an example. We'll return a data frame listing which circles a point is inside of, or return "None" if the point is not inside any circle:

library(dplyr)

datNew %>%
  group_by(latitude, longitude) %>% 
  summarise(in_which_circles = if(any(in_circle=="Yes")) paste(circle_id[in_circle=="Yes"], collapse=",") else "None")
  latitude longitude in_which_circles
     <dbl>     <dbl>            <chr>
1 48.15139  17.07562              1,2
2 48.15144  17.07569              1,2
3 48.15208  17.07538              1,2
4 48.15266  17.07330                1
5 48.15277  17.07514                2
6 48.15404  17.07452             None
7 48.15446  17.07517             None
8 48.15461  17.07560             None