Finkelson Finkelson - 4 months ago 15
JSON Question

Spark UDF returns a length of field instead of length of value

Consider the code below

object SparkUDFApp {
def main(args: Array[String]) {

val df = ctx.read.json(".../example.json")
df.registerTempTable("example")

val fn = (_: String).length // % 10
ctx.udf.register("len10", fn)

val res0 = ctx sql "SELECT len10('id') FROM example LIMIT 1" map {_ getInt 0} collect

println(res0.head)
}
}


JSON example

{"id":529799371026485248,"text":"Example"}


The code should return a length of the field value from JSON (e.g. 'id' has value 18). But instead of returning '18' it returns '2', which is the length of 'id' I suppose.

So my question is how to rewrite UDF to fix it?

Answer

The problem is that you are passing the string id as a literal to your UDF so it is interpreted as one instead of a column (notice that it has 2 letters this is why it returns such number). To solve this just change the way how you formulate the SQL query.

E.g.

val res0 = ctx sql "SELECT len10(id) FROM example LIMIT 1" map {_ getInt 0} collect

// Or alternatively
val len10 = udf(word => word.length)
df.select(len10(df("id")).as("length")).show()