Algina Algina - 9 days ago 5
Python Question

Spark: pyspark crash for some datasets - ubuntu

I am using Ubuntu and a local Spark installation (spark-2.0.2).
My dataset is quiet small, and my code runs in I have a small data.
In case I increase the dataset (txt file) with a few more lines the error occours.

I tried the exact same code on a Cloudera VM, where Hadoop is installed and it works fine.

So, it must be some memory issue or limitation on my Ubuntu machine.

There are some other similar issues like : Apache Spark: pyspark crash for large dataset

but in my case it did not help.
I do not have an Hadoop Cluster, just Spark, python 2.7, and java 1.8.
It works fine, just when there are some more complex calculations or the dataset is bigger it crashes.

Any clue?

The error:


spark-submit myCalc.py


ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 175, in main
process()
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py", line 238, in mergeValues
d[k] = comb(d[k], v) if k in d else creator(v)
File "/home/alg/Documents//Spark/code/customer_orders/myCalc.py", line 24, in <lambda>
reduced_total = RDD_map.reduceByKey(lambda x,y: (x[1]+y[1]))
TypeError: 'float' object has no attribute '__getitem__'

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/12/01 23:25:51 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "/home/alg/Documents//Spark/code/customer_orders/myCalc.py", line 28, in <module>
results = reduced_total.collect()
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 776, in collect
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 175, in main
process()
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py", line 238, in mergeValues
d[k] = comb(d[k], v) if k in d else creator(v)
File "/home/alg/Documents//Spark/code/customer_orders/myCalc.py", line 24, in <lambda>
reduced_total = RDD_map.reduceByKey(lambda x,y: (x[1]+y[1]))
TypeError: 'float' object has no attribute '__getitem__'

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1913)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:912)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:911)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 175, in main
process()
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1792, in combineLocally
File "/home/alg/programs/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py", line 238, in mergeValues
d[k] = comb(d[k], v) if k in d else creator(v)
File "/home/alg/Documents//Spark/code/customer_orders/myCalc.py", line 24, in <lambda>
reduced_total = RDD_map.reduceByKey(lambda x,y: (x[1]+y[1]))
TypeError: 'float' object has no attribute '__getitem__'

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more

Answer

So, it must be some memory issue or limitation on my (...) machine.

It is not. While you did not provide a reproducible example under normal conditions (with sensible __add__ and `getitem' implementations) following function:

lambda x, y: x[1] + y[1]

is not a valid valid choice for reduceByKey. Function you pass to reduceByKey has to be associative and commutative. Obviously it should take arguments of the same type as the return type.

With Python 3.5+ annotations:

from typing import TypeVar

T = TypeVar('T')

def (t1: T, t2: T) -> T:
    return ...

Why function you use doesn't always fail? Because its behavior depends on the data distribution. Let's say you have tuples of shape (string, (string, float)):

will_succeed = sc.parallelize([
  ("a", ("foo", 1.0)), ("a", ("bar", 1.0)),
  ("b", ("foo", 1.0)), ("b", ("bar", 1.0))
], 2)

will_succeed.reduceByKey(lambda x, y: x[1] + y[1]).collect()
[('b', 2.0), ('a', 2.0)]

vs.:

will_fail = sc.parallelize([
  ("a", ("foo", 1.0)), ("a", ("bar", 1.0)), ("a", ("baz", 1.0)),
  ("b", ("foo", 1.0)), ("b", ("bar", 1.0))
], 2)

will_fail.reduceByKey(lambda x, y: x[1] + y[1]).collect()
TypeError: 'float' object is not subscriptable
...

In the first case order of execution for key a will be:

f(("foo", 1.0), ("bar", 1.0))
2.0

where f your function. In the second case it will be equivalent (not necessarily in this order) to:

f(f(("foo", 1.0),  ("bar", 1.0)), ("baz", 1.0))
f(2.0, ("baz", 1.0))
exception!

Correct solution could be:

from operator import itemgetter, add

# will fail no more
will_fail.mapValues(itemgetter(1)).reduceByKey(add) 

It is also possible to use one combineByKey and aggregateByKey:

will_fail.combineByKey(itemgetter(1), lambda x, y: x + y[1], add)