Anthony Anthony - 28 days ago 5
Scala Question

How to use NOT IN from a CSV file in Spark

I use Spark sql to load data into a

val
like this

val customers = sqlContext.sql("SELECT * FROM customers")


But I have a separate txt file that contains one column
CUST_ID
and 50,00 rows. i.e.

CUST_ID
1
2
3


I want my
customers
val to have all customers in
customers
table that are not in the TXT file.

Using Sql I would do this by
SELECT * FROM customers NOT IN cust_id ('1','2','3')


How can I do this using Spark?

I've read the textFile and I can print rows of it but I'm not sure how to match this with my sql query

scala> val custids = sc.textFile("cust_ids.txt")
scala> custids.take(4).foreach(println)
CUST_ID
1
2
3

Answer

You can import your text file as a dataframe and do a left outer join:

val customers = Seq(("1", "AAA", "shipped"), ("2", "ADA", "delivered") , ("3", "FGA", "never received")).toDF("id","name","status")
val custId = Seq(1,2).toDF("custId")

customers.join(custId,'id === 'custId,"leftOuter")
         .where('custId.isNull)
         .drop("custId")
         .show()


+---+----+--------------+
| id|name|        status|
+---+----+--------------+
|  3| FGA|never received|
+---+----+--------------+