Anthony Anthony - 20 days ago 8
SQL Question

How to get records that exist in table 1 but not in table 2

I have two tables: excluded and kaggleresults. I am trying to find records that exist in

excluded
but don't exist in
kaggleresults


counts:

scala> spark.sql("select * from excluded").count()
res136: Long = 4652

scala> spark.sql("select * from kaggleresults").count()
res137: Long = 4635


The difference is
17


scala> res136-res137
res139: Long = 17


I am trying to get those 17 records. I wrote the query below but it returns
38
instead.

scala> spark.sql("select * from excluded left join kaggleresults on kaggleresults.subject_id = excluded.subject_id where kaggleresults.subject_id is null").count()
res135: Long = 38


Question

What query do I need to write to get those 17 records?

Answer

For the actual records (not the counts), you can use

SELECT * FROM excluded
WHERE subject_id NOT IN ( SELECT subject_id FROM kaggleresults )

However, you should not be surprised if the resultant row count does not match the difference of the two tables' individual row counts.

For example: suppose table1 has id's 1, 2, 3, 4, and 5, and table2 has id's 3, 4, 5, and 6. The row counts of table1 and of table2 are 5 and 4, respectively, for a difference of 1, but there are actually two records in table1 that have id's that do not exist in table2.