marcin_koss marcin_koss - 4 months ago 35
Python Question

How to get rows from DF that contain value None in pyspark (spark)

In below example

df.a == 1
predicate returns correct result but
df.a == None
returns 0 when it should return 1.

l = [[1], [1], [2], [2], [None]]
df = sc.parallelize(l).toDF(['a'])
df # DataFrame[a: bigint]
df.collect() # [Row(a=1), Row(a=1), Row(a=2), Row(a=2), Row(a=None)]
df.where(df.a == 1).count() # 2L
df.where(df.a == None).count() # 0L


Using Spark 1.3.1

Answer

You can use Column.isNull method:

df.where(df.a.isNull()).count()

On a side note this behavior is what one could expect from a normal SQL query. Since NULL marks "missing information and inapplicable information" [1] it doesn't make sense to ask if something is equal to NULL. It simply either IS or IS NOT missing.\

Scala API provides special null-safe equality <=> operator so it is possible to do something like this:

df.where($"a" <=> lit(null))

but it doesn't look like a good idea if you ask me.

1.Wikipedia, Null (SQL)