Ian Ian - 3 months ago 15
Scala Question

Use implicit value from one module in another in Scala/Spark

I'm trying to get the

SQLContext
instance from one module in another module. The first module instantiates it to an implicit
sqlContext
and I had (erroneously) thought that I could then use an implicit parameter in the second module, but the compiler informs me that:

could not find implicit value for parameter sqlCtxt: org.apache.spark.sql.SQLContext


Here's the skeletal setup I have (I have elided imports and details):

-----
// Application.scala
-----

package apps

object Application extends App {
val env = new SparkEnvironment("My app", ...)

try {
// Call methods from various packages that use code from internally DFExtensions.scala
}
}

-----
// SparkEnvironment.scala
-----

package common

class SparkEnvironment(val app: String, ...) {
@transient lazy val conf: SparkConf = new SparkConf().setAppName(app)
@transient implicit lazy val sc: SparkContext = new SparkContext(conf)
@transient implicit lazy val sqlContext: SQLContext = new SQLContext(sc)
...
}

-----
// DFExtensions.scala
-----
package util

object DFExtensions {

private def myFun(...)(implicit sqlCtxt: SQLContext) = { ... }

implicit final class DFExt(val df: DataFrame) extends AnyVal {
// Extension methods for DataFrame where myFun is supposed to be used -- causes exception!
}
}


Since it's a multi-project sbt setup I don't want to pass around the instance
env
to all related objects because the stuff in
util
is really a shared library. Each sub-project (i.e. app) has its own instance created in the
main
method.

Because
myFun
is only called from the implicit class
DFExt
I thought about creating an implicit just before each call à la
implicit val sqlCtxt = df.sqlContext
and that compiles but it's kind of ugly and I would not need the implicit in
SparkEnvironment
any longer.

According to this discussion the implicit
sqlContext
instance is not in scope, hence compilation fails. I'm not sure a package object would work because the implicit value and parameter are in different packages.

Is what I'm trying to achieve even possible? Is there a better alternative?

The idea is to have several sub-projects that use the same libraries and core functions to share the same project. They are typically updated together, so it's nice to have them in a single place. Most of the library functions directly work on data frames and other structures in Spark, but occasionally I need to do something that requires an instance of
SparkContext
or
SQLContext
, for instance write a query with
sqlContext.sql
as some syntax is not yet natively supported (e.g. flattening with outer lateral views).

Each sub-project has its own main method that creates an implicit instance. Obviously the libraries do not 'know' about this as they are in different packages and I don't pass around the instances. I had thought that somehow implicits are looked for at runtime, so that when an application runs there is an instance of SQLContext defined as an implicit. It's possible that a) it's not in scope because it's in a different package or b) what I'm trying to do is just a bad idea.

Currently there is only one main method because I first have to split the application in multiple components, which I have not done yet.

Just in case it helps:


  • Spark 1.4.1

  • Scala 2.10

  • sbt 0.13.8


Answer

Because myFun is only called from the implicit class DFExt I thought about creating an implicit just before each call à la implicit val sqlCtxt = df.sqlContext and that compiles but it's kind of ugly and I would not need the implicit in SparkEnvironment any longer.

Just put the implicit and myFun inside DFExt:

implicit final class DFExt(val df: DataFrame) extends AnyVal {
  private implicit def sqlCtxt: SqlContext = df.sqlContext

  // no need to take an implicit parameter, as sqlCtxt is already in scope
  private def myFun(...) = ...

  // The extension methods can now use sqlCtxt and/or myFun freely
}

You could also make sqlCtxt a val, but then: 1) DFExt can't extend AnyVal anymore; 2) it needs to be initialized even if the extension method you call doesn't need it; 3) any calls to sqlCtxt are likely to be inlined, so you are just accessing a val from df instead of this anyway. If they aren't, this means you are using it far too little to matter.

Comments