LostInOverflow LostInOverflow - 1 month ago 5x
R Question

Using R in Apache Spark

There are some options to access R libraries in Spark:

It looks like SparkR is quite limited, OpenCPU requires keeping additional service and bindings can have stability issue. Is there something else specific to Spark architecture which make using any solution not easy.

Do you have any experience with integrating R and Spark you can share?


The main language for the project seems like an important factor.

If pyspark is a good way to use Spark for you (meaning that you are accessing Spark from Python) accessing R through rpy2 should not make much difference from using any other Python library with a C-extension.

There exist reports of users doing so (although with occasional questions such as How can I partition pyspark RDDs holding R functions or Can I connect an external (R) process to each pyspark worker during setup)

If R is your main language, helping the SparkR authors with feedback or contributions where you feel there are limitation would be way to go.

If your main language is Scala, rscala should be your first try.

While the combo pyspark + rpy2 would seem the most "established" (as in "uses the oldest and probably most-tried codebase"), this does not necessarily mean that it is the best solution (and young packages can evolve quickly). I'd assess first what is the preferred language for the project and try options from there.