devopslife devopslife - 1 month ago 13
Java Question

avro error on AWS EMR

I'm using spark-redshift (https://github.com/databricks/spark-redshift) which uses avro for transfer.

Reading from Redshift is OK, while writing I'm getting

Caused by: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter


tried using Amazon EMR 4.1.0 (Spark 1.5.0) and 4.0.0 (Spark 1.4.1).
Cannot do

import org.apache.avro.generic.GenericData.createDatumWriter


either, just

import org.apache.avro.generic.GenericData


I'm using scala shell
Tried download several others avro-mapred and avro jars, tried setting

{"classification":"mapred-site","properties":{"mapreduce.job.user.classpath.first":"true"}},{"classification":"spark-env","properties":{"spark.executor.userClassPathFirst":"true","spark.driver.userClassPathFirst":"true"}}


and adding those jars to spark classpath. Possibly need to tune Hadoop (EMR) somehow.

Does this ring a bell to anyone?

Answer

just for reference - workaround by Alex Nastetsky

delete jars from master node

find / -name "*avro*jar" 2> /dev/null -print0 | xargs -0 -I file sudo rm file

delete jars from slave nodes

yarn node -list | sed 's/ .*//g' | tail -n +3 | sed 's/:.*//g' | xargs -I node ssh node "find / -name "*avro*jar" 2> /dev/null -print0 | xargs -0 -I file sudo rm file

Setting configs correctly as proposed by Jonathan is worth a shot too.