Shreya Kaushik Shreya Kaushik - 1 year ago 57
Scala Question

Pulling 400 million rows using Spark SQL using Jupyter notebook

I am new to Spark and have been trying to execute a Spark SQL that has close to 400 million rows in the result set. I am executing the Spark SQL from Jupyter notebook. I am using Spark on Azure HDInsight. Following are the configurations of the Spark Cluster:


  1. Number Of Cores per Executor - 3

  2. Number of Executors - 5

  3. Executor Memory - 4098 MB



As long as I execute the query to give me first couple of rows, everything works fine. But the moment I try to pull out all the rows i.e. 400 million rows, it throws an error that the "Executor has killed the request".

The first thing I would like to know is that is it possible to pull this volume of data from Spark SQL or Jupyter notebook.

If it is indeed possible to pull this volume, then what is it that I am doing incorrectly?

Currently, I dont have the exact error message, I will update this post with the error message shortly.

But it'll be a great help if anyone can help with this.

Thanks!

Answer Source

Thanks! It helped us arrive to the conclusion regarding the use of spark for our solution.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download