M. Rasyid Ridha M. Rasyid Ridha - 1 month ago 5
R Question

Calculate and summarize total distance in a table using dplyr in R

I have a table consists of user, sequence, and geolocation: x and y

I would like to group it by user and calculate total distance based on the sequence

For example:

> df <- data.frame(user_id=rep(1,3), seq=1:3, x=c(1,5,3), y=c(2,3,9))
> df
user_id seq x y
1 1 1 1 2
2 1 2 5 3
3 1 3 3 9


Here is the function to calculate distance between two points (Euclidean):

> d <- function(n1,n2){
+ d <- sqrt((df$y[n2]-df$y[n1])^2+(df$x[n2]-df$x[n1])^2)
+ return(d)
+ }


I would like to get the total distance like this:

> df <- data.frame(user_id=1, dtot=d(1,2)+d(2,3))
> df
user_id dtot
1 1 10.45


How can I use dplyr "group_by" and get total distance based on the sequence for all users?

Answer

One way to accomplish what you want is to define a function for computing the total distance:

library(dplyr)
total.dist <- function(x,y) {
  sum(sqrt((x-lag(x))^2+(y-lag(y))^2),na.rm=TRUE)
}

The inputs to this function are the column vectors x and y. We compute the distance between each row in vectorized fashion by subtracting with the lag of these columns. Then the total distance is the sum of all the distances computed, removing NAs.

Then using this as a summarise function group_by user_id:

res <- df %>% group_by(user_id) %>% summarise(dtot=total.dist(x,y))
### A tibble: 1 x 2
##  user_id     dtot
##    <dbl>    <dbl>
##1       1 10.44766