Make42 Make42 - 2 months ago 6
Scala Question

How to println from foreach in Jupyter?


val animals = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale", "dolphin", "spider"), 3)
animals.foreachPartition(x => println(x.mkString(", ") + " are animals"))

in spark-shell returns

lion, gnu, crocodile are animals
cat, dog, tiger are animals
ant, whale, dolphin, spider are animals

but if I run this in Jupyter with Apache Toree Spark kernel I get no output. The terminal from which I started Jupyter outputs

animals: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[27] at parallelize at <console>:20
16/05/17 09:33:32 [WARN] o.a.t.k.p.v.s.KernelOutputStream - Suppressing empty output: ''

How do I get Jupyter to output the animals as the spark-shell using foreach?


Generally speaking, you don't. Even if you don't work with Jupyter any output created inside action or transformation will appear somewhere but, unless it is a local mode, it won't be your local shell.

If you want to reliably inspect some part of the data you should fetch data to the driver and inspect locally.


On a side note I would avoid printing anyways. Unlike logging it is not easily configurable and can become a serious bottleneck in your code.