Make42 Make42 - 4 months ago 21
Scala Question

How to println from foreach in Jupyter?

Running

val animals = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale", "dolphin", "spider"), 3)
animals.foreachPartition(x => println(x.mkString(", ") + " are animals"))


in spark-shell returns

lion, gnu, crocodile are animals
cat, dog, tiger are animals
ant, whale, dolphin, spider are animals


but if I run this in Jupyter with Apache Toree Spark kernel I get no output. The terminal from which I started Jupyter outputs

animals: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[27] at parallelize at <console>:20
16/05/17 09:33:32 [WARN] o.a.t.k.p.v.s.KernelOutputStream - Suppressing empty output: ''


How do I get Jupyter to output the animals as the spark-shell using foreach?

Answer

Generally speaking, you don't. Even if you don't work with Jupyter any output created inside action or transformation will appear somewhere but, unless it is a local mode, it won't be your local shell.

If you want to reliably inspect some part of the data you should fetch data to the driver and inspect locally.

animals.take(3).foreach(println)

On a side note I would avoid printing anyways. Unlike logging it is not easily configurable and can become a serious bottleneck in your code.