sathiyarajan sathiyarajan - 3 years ago 118
Java Question

How to write word count using Dataset API?

I need to write a word count logic with using spark dataset alone.

I implemented the same using

JavaRDD
class of spark, but I want to done the same process by using
Dataset<Row>
class of
sparksql
.

How to do word count in Spark SQL?

Answer Source

That's one of the solutions (and quite likely not the most effective).

// using col function as the OP uses Java not Scala...unfortunatelly
import org.apache.spark.sql.functions.col
val q = spark.
  read.
  text("README.md").
  filter(length(col("value")) > 0).
  withColumn("words", split(col("value"), "\\s+")).
  select(explode(col("words")) as "word").
  groupBy("word").
  count.
  orderBy(col("count").desc)
scala> q.show
+---------+-----+
|     word|count|
+---------+-----+
|      the|   24|
|       to|   17|
|    Spark|   16|
|      for|   12|
|      and|    9|
|       ##|    9|
|         |    8|
|        a|    8|
|       on|    7|
|      can|    7|
|      run|    7|
|       in|    6|
|       is|    6|
|       of|    5|
|    using|    5|
|      you|    4|
|       an|    4|
|    build|    4|
|including|    4|
|     with|    4|
+---------+-----+
only showing top 20 rows
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download