In Spark, I know that errors are recovered by doing recomputation of the RDDs unless an RDD is cached. In that case, the computation can start from that cached RDD.
My question is, how errors are recovered in MapReduce frameworks (such as Apache Hadoop). Let us say, a failure occured in the shuffle phase (After map and before reduce that is), how would it be recovered. Would the map step be performed again. Is there any stage in MapReduce where output is stored in the HDFS, so that computation can restart only from there? And what about a Map after Map-Reduce. Is output of reduce stored in HDFS?
What you are referring to is classified as failure of
task which could be either a
map task or
In case of a particular
task failure, Hadoop initiates another computational resource in order to perform failed map or reduce tasks.
When it comes to failure of
shuffle and sort process, it is basically a failure in the particular node where
reducer task has failed and it would be set to run afresh in another resource (btw, reducer phase begin with shuffle and sort process).
Of course it would not allocate the tasks infinitely if they keep failing. There are two properties below which can determine how many failures or attempts of a task could be acceptable.
mapred.map.max.attempts for Map tasks and a property
mapred.reduce.max.attempts for reduce tasks.
By default, if any task fails four times (or whatever you configure in those properties), the whole job would be considered as failed. - Hadoop Definitive Guide
shuffle and sort being a part of reducer, it would only initiate attempt to rerun reducer task. Map tasks would not be re-run as they are considered as completed.
Is there any stage in MapReduce where output is stored in the HDFS, so that computation can restart only from there?
Only the final output would be stored in HDFS. Map's outputs are classified as intermediate and would not be stored in HDFS as HDFS would replicate the data stored and basically why would you want HDFS to manage intermediate data that's of no use after the completion of job. There would be additional overhead of cleaning it up as well. Hence Maps output are not stored in HDFS.
And what about a Map after Map-Reduce. Is output of reduce stored in HDFS?
The output of reducer would be stored in HDFS. For Map, I hope the above description would suffice.