Chiron Chiron - 1 month ago 5
Bash Question

How to understand the progress bar for take() in spark-shell

I called

take()
method of an
RDD[LabeledPoint]
from spark-shell, which seemed to be a laborious job for spark.

The spark-shell shows a progress-bar:

spark-shell progress bar

The progress-bar fills again and again and I don't know how to produce a reasonable estimate of the needed time (or total-progress ) from those numbers above.

Does anyone know what those numbers mean ?

Thanks in advance.

Answer

The numbers show the Spark stage that is running, the number of completed, in-progress, and total tasks in the stage. (See What do the numbers on the progress bar mean in spark-shell? for more on the progress bar.)

Spark stages run tasks in parallel. In your case 5 tasks are running in parallel at the moment. If each task takes roughly the same time, this should give you an idea of how much longer you have to wait for this stage to finish.

But RDD.take can take more than one stage. take(1) will first get the first element of the first partition. If the first partition is empty, it will take the first elements from the second, third, fourth, and fifth partitions. The number of partitions it looks at in each stage is 4× the number of partitions already checked. So if you have a whole lot of empty partitions, take(1) can take many iterations. This can be the case for example if you have a large amount of data, then do filter(_.name == "John").take(1).

If you know your result will be small, you can save time by using collect instead of take(1). This will always gather all the data in a single stage. The main advantage is that in this case all the partitions will be processed in parallel, instead of the somewhat sequential manner of take.