I am processing large amounts of data using hadoop MapReduce. The problem is that, ocassionaly, a corrupt file causes Map task to throw a java heap space error or something similar.
It would be nice, if possible, to just discard whatever that map task was doing, kill it, and move on with the job, never mind the lost data. I don't want the whole M/R job to fail because of that.
Is this possible in hadoop and how?
You can modify the
mapreduce.max.map.failures.percent parameter. The default value is 0. Increasing this parameter will allow a certain percentage of map tasks to fail without failing the job.
You can set this parameter in mapred-site.xml (will apply to all jobs), or on a job-by-job basis (probably safer).