krunarsson krunarsson - 1 month ago 13
Scala Question

Get distinct items from rows of comma separated strings in Spark 2.0

I am using Spark 2.0 to analyze a data set. One column contains string data like this:

A,C
A,B
A
B
B,C


I want to get a JavaRDD with all distinct items that appears in the column, something like this:

A
B
C


How can this be done efficiently in spark? I am using Spark with Java, but Scala examples or pointers would be useful.

Edit:
I have tried using flatMap, but my implementation is very slow.

JavaRDD<String> d = dataset.flatMap(s -> Arrays.asList(s.split(",")).iterator())

Answer

try using:

1) explode: https://spark.apache.org/docs/2.0.0/api/java/ org.apache.spark.sql.functions.explode(Column col)

static Column   explode(Column e)

Explode - Creates a new row for each element in the given array or map column.

2) Then execute "distinct" on this column:

http://spark.apache.org/docs/latest/programming-guide.html

distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset.

Summary

Explode will result with one item per row (in this specific column).

Distinct will leave only the distinct items

Comments