João_testeSW João_testeSW - 3 months ago 23
Scala Question

Union all files from directory and sort based on first column

After implement the code below:

def ngram(s: String, inSep: String, outSep: String, n:Int): Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.mkString(outSep)).toSet
}

val fPath = "/user/root/data/data220K.txt"
val resultPath = "data/result220K"

val lines = sc.textFile(fPath) // lines: Array[String]

val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word, 1)).reduceByKey((a, b) => a+b)
val sortedResult = result.map(pair => pair.swap).sortByKey(true)
sortedResult.count + "============================")
sortedResult.take(10)
sortedResult.saveAsTextFile(resultPath)


I'm getting a big amount of files in HDFS with this schema:
(Freq_Occurrencies, FieldA, FieldB)

Is possible to join all the file from that directory? Every rows are diferent but I want to have only one file sorted by the Freq_Occurrencies. Is possible?

Many thanks!

Answer
sortedResult
  .coalesce(1, shuffle = true)
  .saveAsTextFile(resultPath)` 

coalesce makes Spark use a single task for saving, thus creating only one part. The downside is, of course, performance - all data will have to be shuffled to a single executor and saved using a single thread.