HackerDuck HackerDuck - 5 months ago 44
Scala Question

Filtering RDD by substring values

I want to filter out some entries from

based on analyzing values in substrings:

This is my sampe data:

(600,List((600,111,1,1), (615,111,1,5)))
(600,List((638,111,2,null), (649,222,3,1)))
(600,List((638,111,2,3), (649,null,3,1)))

In particular I want to check the 4th field in each substring (if started counting from 1). If it's equal to
, then the whole entry should be deleted. The result should be the following:

(600,List((600,111,1,1), (615,111,1,5)))
(600,List((638,111,2,3), (649,null,3,1)))

So, in this particular example the second entry should be deleted.

This is my attempt to solve this task:

val filtered = separated.map(l => (l._1,l._2.filter(!_._4.equals("null"))))

The problem is that it just deletes the substring, but not the whole entry. The result is the following (instead of the above-mentioned one):

(600,List((600,111,1,1), (615,111,1,5)))
(600,List((638,111,2,3), (649,null,3,1)))


Filter your RDD by checking that the list of tuples does not have a tuple with 4th entry "null"

  case (id, list) => !list.exists(t => t._4.equals("null"))