Alexander Alexander - 2 months ago 17
R Question

Things are slower with "dplyr" is there a faster way?

I m just trying to calculate the relative angle between with my x,y,z data frame to the reference vector. So far, I use

dplyr
to group things and apply my
angle
function to get relative angle. However things are quite slow even for dummy data that I provide here.

set.seed(12345)

x <- replicate(1,c(replicate(1000,rnorm(50,0,0.01))))
y <- replicate(1,c(replicate(1000,rnorm(50,0,0.01))))
z <- replicate(1,c(replicate(1000,rnorm(50,0.9,0.01))))
ref_vector <- data.frame(ref_x=rep(0,100),ref_y=rep(0,100),ref_z=rep(1,100))
set <- rep(seq(1,1000),each=50)

data_rep <- data.frame(x,y,z,ref_vector,set)


>

head(data_rep)
# x y z ref_x ref_y ref_z set
# 1 0.005855288 -0.015472796 0.9059337 0 0 1 1
# 2 0.007094660 -0.013354359 0.9040137 0 0 1 1
# 3 -0.001093033 -0.014661486 0.9047502 0 0 1 1
# 4 -0.004534972 -0.002764655 0.9070553 0 0 1 1
# 5 0.006058875 -0.008339952 0.8926551 0 0 1 1
# 6 -0.018179560 -0.008412400 0.9055541 0 0 1 1


I define the angle between two vectors with this
angle
function,

angle <- function(x,y){
dot.prod <- x%*%y
norm.x <- norm(x,type="2")
norm.y <- norm(y,type="2")
theta <- acos(dot.prod / (norm.x * norm.y))
as.numeric(theta)
}


then lets apply this to our
data_rep


library(dplyr)
system.time(df_angle <- data_rep%>%
rowwise()%>%
do(data.frame(.,angle_rad=angle(unlist(.[1:3]),unlist(.[4:6]))))%>%
group_by(set)%>%
mutate(angle=angle_rad*180/pi, mean_angle=mean(angle)))

# user system elapsed
# 64.22 0.08 64.81
# Warning message:
# Grouping rowwise data frame strips rowwise nature


As you can see, the process took around 1 min and I even did not provide all my real data set which has 350000 rows and it takes 10 min to calculate the relative angle.

I wonder is there any way to speed up this process.

Thanks!

Answer

Just make a simple mutatestatement instead of your do(data.frame()) part. This improves the performance quite a bit, because you no longer have to convert each row into a data.frame

system.time(df_angle2 <- data_rep%>%
              rowwise() %>% 
              mutate(angle_rad=angle(x = c(x,y,z),y = c(ref_x,ref_y,ref_z))) %>%
              group_by(set)%>%
              mutate(angle=angle_rad*180/pi, mean_angle=mean(angle)))

##      user      system     elapsed 
##      3.72        0.00        3.71

all.equal(df_angle,df_angle2)
##   TRUE