Anthony Anthony - 21 days ago 8
Scala Question

How to troubleshoot java.lang.NumberFormatException: null

I am loading a file that has ~500,000 records such as this

ROW_ID, COLOR_CODE, SHADE_ID
21, 22, 321
23, 31, 321


I load it like this:

val colorSchema = StructType(Array(
StructField("ROW_ID", IntegerType, true),
StructField("COLOR_CODE", IntegerType, true),
StructField("SHADE_ID", IntegerType, true)

def makeSchema(filename:String, tableName:String,
tableSchema:StructType,uri:String){

val table = spark.read.
format("com.databricks.spark.csv").
option("header", "true").
schema(tableSchema).load(uri+filename).cache()
table.registerTempTable(tableName.toUpperCase)
}

makeSchema("colors.csv","colors",colorSchema,"s3://bucket/")


The above code runs fine. However, when I run the following query I get an error
java.lang.NumberFormatException: null


val r = spark.sql("select * from colors where COLOR_CODE = 22").take(1)


What am I doing wrong? And how can I spot this issue in an effective way? I have visually scanned the file to see if
COLOR_CODE
has missing values but I can't see any visually...

Update

I've asked a separate question that narrows down the problem further. The CSV now only has 1 row and I still get the same error. How to resolve java.lang.NumberFormatException: null in Spark-sql

Answer

Maybe you have null/empty-values in your csv, or other strings which cannot be parsed to an int.

If the problems is with null-values, you can try this:

val table = spark.read.
           format("com.databricks.spark.csv").
           option("header", "true").
           option("nullValue","null").
           option("treatEmptyValuesAsNulls,","true").
           schema(tableSchema).load(uri+filename).cache()
Comments