I have two tables: excluded and kaggleresults. I am trying to find records that exist in
scala> spark.sql("select * from excluded").count()
res136: Long = 4652
scala> spark.sql("select * from kaggleresults").count()
res137: Long = 4635
res139: Long = 17
scala> spark.sql("select * from excluded left join kaggleresults on kaggleresults.subject_id = excluded.subject_id where kaggleresults.subject_id is null").count()
res135: Long = 38
For the actual records (not the counts), you can use
SELECT * FROM excluded WHERE subject_id NOT IN ( SELECT subject_id FROM kaggleresults )
However, you should not be surprised if the resultant row count does not match the difference of the two tables' individual row counts.
For example: suppose table1 has id's 1, 2, 3, 4, and 5, and table2 has id's 3, 4, 5, and 6. The row counts of table1 and of table2 are 5 and 4, respectively, for a difference of 1, but there are actually two records in table1 that have id's that do not exist in table2.