David Portabella David Portabella - 2 months ago 15
Scala Question

Query a MySQL Database from Spark

I am currently querying a MySQL from Spark as follows:

val sqlContext = new SQLContext(sc)

val dataframe_mysql = sqlContext
.read.format("jdbc").option("driver", "com.mysql.jdbc.Driver")
.option("url", url)
.option("dbtable", "MyTable")
.option("user", "MyUser").option("password", "MyPwd")
.load()

dataframe_mysql.registerTempTable("MyTable")

val lastNames = dataframe_mysql.sqlContext.sql("select lastName from MyTable")


But this loads all the columns from the
MyTable
, and I don't need that (I only need the
lastName
column). (moreover, I need a sql join query, and I don't want to load everything into Spark).

How can I replace the
option("dbtable", "MyTable")
line to specify a SQL query instead of a table?

I have tried querying with
.option("dbtable", "select lastName from MyTable")
(as I read somewhere), but this fails with :

MySQLSyntaxErrorException: select lastName from MyTable WHERE 1=0

Answer

What you have read is correct, you'll need to pass the query with pushdown predicates as an option.

MySQL expects here a aliased table or the table fullname :

// without pushdown predicates
.option("dbtable", "MyTable")

// with pushdown predicates
.option("dbtable", "(select lastName from MyTable) as T")

So you are actually looking to use the later query.

Note: T can be anything here.