Narendra Prasad Narendra Prasad - 3 years ago 109
Scala Question

Case insensitive search in array type column spark dataframe

I have spark data frame like following:

+----------+-------------------------------------------------+
|col1 |words |
+----------+-------------------------------------------------+
|An |[An, attractive, ,, thin, low, profile] |
|attractive|[An, attractive, ,, thin, low, profile] |
|, |[An, attractive, ,, thin, low, profile] |
|thin |[An, attractive, ,, thin, low, profile] |
|rail |[An, attractive, ,, thin, low, profile] |
|profile |[An, attractive, ,, thin, low, profile] |
|Lighter |[Lighter, than, metal, ,, Level, ,, and, tes] |
|than |[Lighter, than, metal, ,, Level, ,, and, tww] |
|steel |[Lighter, than, metal, ,, Level, ,, and, test] |
|, |[Lighter, than, metal, ,, Level, ,, and, Test] |
|Level |[Lighter, than, metal, ,, Level, ,, and, test] |
|, |[Lighter, than, metal, ,, Level, ,, and, ste] |
|and |[Lighter, than, metal, ,, Level, ,, and, ste] |
|Test |[Lighter, than, metal, ,, Level, ,, and, Ste] |
|Renewable |[Renewable, resource] |
|Resource |[Renewable, resource] |
|No |[No1, Bal, testme, saves, time, and, money] |
+----------+-------------------------------------------------+


I want to filter the data from the above column as case insensitive.
Currently I am doing like this.

df.filter(array('words, "level")).show(false)


but it is not showing any data.
please help me to resolve the issue.

Answer Source

For this you can create a simple udf that converts both the case to lower case and filters

Here is the simple example,

scala> import spark.implicits._
import spark.implicits._

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = Seq(("An", List("An", "attractive"," ","", "thin", "low", "profile")), ("Lighter", List("Lighter", "than", "metal"," " ,"", "Level"," " ,"", "and", "tes"))).toDF("col1", "words")
df: org.apache.spark.sql.DataFrame = [col1: string, words: array<string>]

scala> val filterUdf = udf((arr: Seq[String]) => arr.map(_.toLowerCase).contains("level".toLowerCase))
filterUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(ArrayType(StringType,true))))

scala> df.filter(filterUdf($"words")).show(false)

+-------+-------------------------------------------------+
|col1   |words                                            |
+-------+-------------------------------------------------+
|Lighter|[Lighter, than, metal,  , , Level,  , , and, tes]|
+-------+-------------------------------------------------+

Hope this helps!

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download