HackerDuck HackerDuck - 3 months ago 24
Scala Question

Filtering RDD by substring values

I want to filter out some entries from

RDD[(String,List[(String,String,String,String)]
based on analyzing values in substrings:

This is my sampe data:

(600,List((600,111,1,1), (615,111,1,5)))
(600,List((638,111,2,null), (649,222,3,1)))
(600,List((638,111,2,3), (649,null,3,1)))


In particular I want to check the 4th field in each substring (if started counting from 1). If it's equal to
null
, then the whole entry should be deleted. The result should be the following:

(600,List((600,111,1,1), (615,111,1,5)))
(600,List((638,111,2,3), (649,null,3,1)))


So, in this particular example the second entry should be deleted.

This is my attempt to solve this task:

val filtered = separated.map(l => (l._1,l._2.filter(!_._4.equals("null"))))


The problem is that it just deletes the substring, but not the whole entry. The result is the following (instead of the above-mentioned one):

(600,List((600,111,1,1), (615,111,1,5)))
(600,List((649,222,3,1)))
(600,List((638,111,2,3), (649,null,3,1)))

Answer

Filter your RDD by checking that the list of tuples does not have a tuple with 4th entry "null"

yourRdd.filter({
  case (id, list) => !list.exists(t => t._4.equals("null"))
})
Comments