tammuz tammuz - 12 days ago 7
Scala Question

How to fastly execute mysql query in spark scala

I am trying to use Scala spark within eclipse to obtain data from MySQL database.
The problem is that the code is taking hours juste to execute one SQL query.
This is my initial code:

val conf = new SparkConf().setAppName("MyApp").setMaster("local")
val sc = new SparkContext(conf)
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
sqlcontext.setConf("url", "mysql://localhost:3306/myDB?user=us&password=pw")

val action = sqlcontext.jdbc(jdbcUrl, "action").registerTempTable("action")
val session = sqlcontext.jdbc(jdbcUrl, "session").registerTempTable("session")
sqlcontext.cacheTable("action")
sqlcontext.cacheTable("session")


Then in order to obtain data from the database I tried many commands

val data = sqlcontext.sql('SELECT * FROM action INNER JOIN session ON action.session_id = session.session_id")


This takes many hours to be done, so I tried juste to get the table:

val df = sqlcontext.table("action").collect()
println(df.size)


But this did not solve my problem. Finally just want to say that my action table contains about 11 millions rows.

Any Idea?

Answer

The are multiple reasons for long running job. As you mentioned your master is "local" you are running on a single executor thread. Spark will do better when it is partitioned well. please check how many partitions are created in your case. if it is one please do re-partition by using repartition(numberofpartitions : int) and run with more threads to achieve parallel processing(local[8]/local[*]).

Comments