Mahadevan Mahadevan - 1 month ago 20
Scala Question

Not able to create parquet files in hdfs using spark shell

I want to create parquet file in hdfs and then read it through hive as external table. I'm struck with stage failures in spark-shell while writing parquet files.

Spark Version: 1.5.2
Scala Version: 2.10.4
Java: 1.7


Input file:(employee.txt)

1201,satish,25

1202,krishna,28

1203,amith,39

1204,javed,23

1205,prudvi,23

In Spark-Shell:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val employee = sc.textFile("employee.txt")
employee.first()
val schemaString = "id name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType};
val schema = StructType(schemaString.split(" ").map(fieldName ⇒ StructField(fieldName, StringType, true)))
val rowRDD = employee.map(_.split(",")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
val finalDF = employeeDF.toDF();
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
var WriteParquet= finalDF.write.parquet("/user/myname/schemaParquet")


When I type the last command I get,

ERROR

SPARK APPLICATION MANAGER

I even tried increasing the executor memory, its still failing.
Also Importantly , finalDF.show() is producing the same error.
So, I believe I have made a logical error here.

Thanks for supporting

Answer

The issue here is you are creating a schema with all the fields/columns type defaulted to StringType. But while passing the values in the schema, the value of Id and Age is being converted to Integer as per the code.Hence, throwing the Matcherror while running.

The data types of columns in the schema should match the data type of values being passed to it. Try the below code.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val employee = sc.textFile("employee.txt")
employee.first()
//val schemaString = "id name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types._;
val schema = StructType(StructField("id", IntegerType, true) :: StructField("name", StringType, true) :: StructField("age", IntegerType, true) :: Nil)
val rowRDD = employee.map(_.split(" ")).map(e ⇒ Row(e(0).trim.toInt, e(1), e(2).trim.toInt))
val employeeDF = sqlContext.createDataFrame(rowRDD, schema)
val finalDF = employeeDF.toDF();
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
var WriteParquet= finalDF.write.parquet("/user/myname/schemaParquet")

This code should run fine.