Rakesh Kumar Rakesh Kumar - 1 month ago 42
R Question

Trying to Connect R to Spark using Sparklyr

I'm trying to connect R to Spark using Sparklyr.

I followed the tutorial from rstudio blog

I tried installing sparklyr using


  • install.packages("sparklyr")
    which went fine but In another post, I saw that there was a bug in sparklyr_0.4 version. So I followed the instruction to download the dev version using

  • devtools::install_github("rstudio/sparklyr")
    which also went fine and now my sparklyr version is sparklyr_0.4.16.



I followed the rstudio tutorial to download and install spark using

spark_install(version = "1.6.2")


When I tried to first connect to spark using

sc <- spark_connect(master = "local")


got the following error.

Created default hadoop bin directory under: C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop
Error:
To run Spark on Windows you need a copy of Hadoop winutils.exe:
1. Download Hadoop winutils.exe from:
https://github.com/steveloughran/winutils/raw/master/hadoop-2.6.0/bin/
2. Copy winutils.exe to C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin
Alternatively, if you are using RStudio you can install the RStudio Preview Release,
which includes an embedded copy of Hadoop winutils.exe:
https://www.rstudio.com/products/rstudio/download/preview/**


I then downloaded winutils.exe and placed it in
C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin
- This was given in instruction.

I tried connecting to spark again.

sc <- spark_connect(master = "local",version = "1.6.2")


but I got the following error

Error in force(code) :
Failed while connecting to sparklyr to port (8880) for sessionid (8982): Gateway in port (8880) did not respond.
Path: C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\spark-submit2.cmd
Parameters: --class, sparklyr.Backend, --packages, "com.databricks:spark-csv_2.11:1.3.0", "C:\Users\rkaku\Documents\R\R-3.2.3\library\sparklyr\java\sparklyr-1.6-2.10.jar", 8880, 8982
Traceback:
shell_connection(master = master, spark_home = spark_home, app_name = app_name, version = version, hadoop_version = hadoop_version, shell_args = shell_args, config = config, service = FALSE, extensions = extensions)
start_shell(master = master, spark_home = spark_home, spark_version = version, app_name = app_name, config = config, jars = spark_config_value(config, "spark.jars.default", list()), packages = spark_config_value(config, "sparklyr.defaultPackages"), extensions = extensions, environment = environment, shell_args = shell_args, service = service)
tryCatch({
gatewayInfo <- spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, config = config, isStarting = TRUE)
}, error = function(e) {
abort_shell(paste("Failed while connecting to sparklyr to port (", gatewayPort, ") for sessionid (", sessionId, "): ", e$message, sep = ""), spark_submit_path, shell_args, output_file, error_file)
})
tryCatchList(expr, classes, parentenv, handlers)
tryCatchOne(expr, names, parentenv, handlers[[1]])
value[[3]](cond)
abort_shell(paste("Failed while connecting to sparklyr to port (", gatewayPort, ") for sessionid (", sessionId, "): ", e$message, sep = ""), spark_submit_path, shell_args, output_file, error_file)

---- Output Log ----
The system cannot find the path specified.


Can somebody please help me to solve this Issue. I'm sitting on this issue from past 2 weeks without much help. Really appreciate anyone who could help me resolve this.

Answer

I finally figured out the issue and am really happy that could do it all by myself. Obviously with lot of googling.

The issue was with Winutils.exe.

R studio does not give the correct location to place the winutils.exe. Copying from my question - location to paste winutils.exe was C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin.

But while googling i figured out that there's a log file that will be created in temp folder to check for the issue, which was as below.

java.io.IOException: Could not locate executable C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\bin\winutils.exe in the Hadoop binaries

Location given in log file was not same as the location suggested by R Studio :) Finally after inserting winutils.exe in the location referred by spark log file, I was able to successfully connect to Sparklyr ...... wohooooooo!!!! I'll have to say 3 weeks of time was gone in just connecting to Spark, but all worth it :)

Comments