ayshelina - 1 year ago 93
R Question

# R - dividing a huge dataframe of latitude/longitude points into groups according to location

I am new at R, but I hear that it is really a bad idea to use

`for`
loops. I have working code using them, but I would like to improve it because it's extremely slow with big data. I already have a few ideas how to improve the algorithm, but what I don't know is how to vectorize this, or do it without
`for`
loops.

I am simply grouping lat/lng points into a circles with radius as parameter.

An example output of the function(only fills the values in the circle_id column), radius was set to 100 meters:

``````[1] "Locations: "
latitude  longitude sensor_time sensor_time2         circle_id
48.15144  17.07569  1447149703  2015-11-10 11:01:43         1
48.15404  17.07452  1447149743  2015-11-10 11:02:23         2
48.15277  17.07514  1447149762  2015-11-10 11:02:42         3
48.15208  17.07538  1447149771  2015-11-10 11:02:51         1
48.15461  17.07560  1447149773  2015-11-10 11:02:53         4
48.15139  17.07562  1447149811  2015-11-10 11:03:31         1
48.15446  17.07517  1447149866  2015-11-10 11:04:26         2
48.15266  17.07330  1447149993  2015-11-10 11:06:33         5
``````

So I have 2 for loops, loop1 goes through every line and loop2 goes through every previous circle_id's and checks if current location from loop1 is within the radius of existing circles from loop2. The centre of each circle_id is the first location found outside all previous one's radius.

Here's the code:

``````init_circles = function(datfr, radius) {
cnt = 1
datfr\$circle_id[1] = 1
longitude = datfr\$longitude[1]
latitude = datfr\$latitude[1]
circle_id = datfr\$circle_id[1]
datfr2 <- data.frame(longitude, latitude, circle_id)

for (i in 2:NROW(datfr)) {
for (j in 1:NROW(datfr2)) {
tmp = distHaversine(c(datfr\$longitude[i],datfr\$latitude[i]) ,c(datfr2\$longitude[j],datfr2\$latitude[j]))
if (tmp < radius){
datfr\$circle_id[i] = datfr2\$circle_id[j]
break
}
}
if (datfr\$circle_id[i]<1){
cnt = cnt +1
datfr\$circle_id[i] = cnt
datfr2[nrow(datfr2)+1,] = c(datfr\$longitude[i],datfr\$latitude[i],datfr\$circle_id[i])
}
}
return(datfr)
}
``````

datfr is the input dataframe without circle_id's set, and datfr2 is a temporary dataframe containing already existing circles.

EDIT: here is a visual output:

You can see what those circles are used for, the upper red circle has 21 other locations that fit within its radius (21 + 1 original = 22)

Thank you so much for helping,
Alena

I've assumed we have a data frame `circles` with the center and radius of each circle and that the sample data posted in your question is in a data frame called `dat`. The code below vectorizes the calculation of distance and uses `lapply` to calculate the distance of each point from the center of each circle and to determine if each point is inside the radius of that circle.

``````library(geosphere)

# We'll check the distance of each data point from the center of each
#  of these circles
circles = data.frame(ID=1:2, lon=c(17.074, 17.076), lat=c(48.1513, 48.15142),

datNew = lapply(1:nrow(circles), function(i) {

df = dat

df\$dist = distHaversine(df[,c("longitude", "latitude")],
circles[rep(i,nrow(df)), c('lon','lat')])

df\$in_circle = ifelse(df\$dist <= circles[i, "radius"], "Yes", "No")

df\$circle_id = circles[i, "ID"]

df

})

datNew = do.call(rbind, datNew)

datNew
``````
``````   latitude longitude sensor_time sensor_time2    time3      dist in_circle circle_id
1  48.15144  17.07569  1447149703   2015-11-10 11:01:43 126.47756       Yes         1
2  48.15404  17.07452  1447149743   2015-11-10 11:02:23 307.45048        No         1
3  48.15277  17.07514  1447149762   2015-11-10 11:02:42 184.24465        No         1
4  48.15208  17.07538  1447149771   2015-11-10 11:02:51 134.32601       Yes         1
5  48.15461  17.07560  1447149773   2015-11-10 11:02:53 387.15358        No         1
6  48.15139  17.07562  1447149811   2015-11-10 11:03:31 120.73138       Yes         1
7  48.15446  17.07517  1447149866   2015-11-10 11:04:26 362.34236        No         1
8  48.15266  17.07330  1447149993   2015-11-10 11:06:33 160.07179       Yes         1
9  48.15144  17.07569  1447149703   2015-11-10 11:01:43  23.13059       Yes         2
10 48.15404  17.07452  1447149743   2015-11-10 11:02:23 311.68096        No         2
11 48.15277  17.07514  1447149762   2015-11-10 11:02:42 163.29068       Yes         2
12 48.15208  17.07538  1447149771   2015-11-10 11:02:51  86.70762       Yes         2
13 48.15461  17.07560  1447149773   2015-11-10 11:02:53 356.34955        No         2
14 48.15139  17.07562  1447149811   2015-11-10 11:03:31  28.41890       Yes         2
15 48.15446  17.07517  1447149866   2015-11-10 11:04:26 343.97933        No         2
16 48.15266  17.07330  1447149993   2015-11-10 11:06:33 243.44024        No         2
``````

So we now have a data frame telling us whether each point is inside a given circle. The data frame is in long format, meaning that there are `n` rows for each point in the original data frame `dat`, where `n` is the number of rows in the `circles` data frame. From here, you can do further processing, such as just keeping one row for each point that's in multiple circles, etc.

Here's an example. We'll return a data frame listing which circles a point is inside of, or return "None" if the point is not inside any circle:

``````library(dplyr)

datNew %>%
group_by(latitude, longitude) %>%
summarise(in_which_circles = if(any(in_circle=="Yes")) paste(circle_id[in_circle=="Yes"], collapse=",") else "None")
``````
``````  latitude longitude in_which_circles
<dbl>     <dbl>            <chr>
1 48.15139  17.07562              1,2
2 48.15144  17.07569              1,2
3 48.15208  17.07538              1,2
4 48.15266  17.07330                1
5 48.15277  17.07514                2
6 48.15404  17.07452             None
7 48.15446  17.07517             None
8 48.15461  17.07560             None
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download