Merve Bozo Merve Bozo - 3 months ago 69
Scala Question

Count number of NULL values in a row of Dataframe table in Apache Spark using Scala

I want to do some preprocessing on my data and I want to drop the rows that are sparse (for some threshold value).

For example I have a dataframe table with 10 features, and I have a row with 8 null value, then I want to drop it.

I found some related topics but I cannot find any useful information for my purpose.

Examples like in the link above won't work for me, because I want to do this preprocessing automatically. I cannot write the column names and do something accordingly.

So is there anyway to do this delete operation without using the column names in Apache Spark with scala?

mlk mlk

Test date:

case class Document( a: String, b: String, c: String)
val df = sc.parallelize(Seq(new Document(null, null, null), new Document("a", null, null), new Document("a", "b", null), new Document("a", "b", "c"), new Document(null, null, "c"))).df

With UDF

Remixing the answer by David and my RDD version below, you can do it using a UDF that takes a row:

def nullFilter = udf((x:Row) => {Range(0, x.length).count(x.isNullAt(_)) < 2})
df.filter(nullFilter(struct( : _*))).show

With RDD

You could turn it into a rdd, loop of the columns in the Row and count how many are null.

sqlContext.createDataFrame(df.rdd.filter( x=> Range(0, x.length).count(x.isNullAt(_)) < 2 ), df.schema).show