M. Rasyid Ridha - 7 months ago 29

R Question

I have a table consists of user, sequence, and geolocation: x and y

I would like to group it by user and calculate total distance based on the sequence

For example:

`> df <- data.frame(user_id=rep(1,3), seq=1:3, x=c(1,5,3), y=c(2,3,9))`

> df

user_id seq x y

1 1 1 1 2

2 1 2 5 3

3 1 3 3 9

Here is the function to calculate distance between two points (Euclidean):

`> d <- function(n1,n2){`

+ d <- sqrt((df$y[n2]-df$y[n1])^2+(df$x[n2]-df$x[n1])^2)

+ return(d)

+ }

I would like to get the total distance like this:

`> df <- data.frame(user_id=1, dtot=d(1,2)+d(2,3))`

> df

user_id dtot

1 1 10.45

How can I use dplyr "group_by" and get total distance based on the sequence for all users?

Answer

One way to accomplish what you want is to define a function for computing the total distance:

```
library(dplyr)
total.dist <- function(x,y) {
sum(sqrt((x-lag(x))^2+(y-lag(y))^2),na.rm=TRUE)
}
```

The inputs to this function are the column vectors `x`

and `y`

. We compute the distance between each row in vectorized fashion by subtracting with the `lag`

of these columns. Then the total distance is the `sum`

of all the distances computed, removing `NA`

s.

Then using this as a `summarise`

function `group_by`

`user_id`

:

```
res <- df %>% group_by(user_id) %>% summarise(dtot=total.dist(x,y))
### A tibble: 1 x 2
## user_id dtot
## <dbl> <dbl>
##1 1 10.44766
```