Tangle Tangle - 4 months ago 78
Scala Question

How to Remove first few lines/header from multiple files using scala in spark

I was able to remove the first few lines of a single file using the code below:

scala> val file = sc.textFile("file:///root/path/file.csv")


Removing first 5 lines:

scala> val Data = file.mapPartitionsWithIndex{ (idx, iter) => if (idx == 0) iter.drop(5) else iter }


The problem is: Suppose that I have multiple files with the same columns, and I want to load all of them into rdd, removing the first few lines of each file.

Is this actually possible?

I'd appreciate any help. Thanks in advance!

Answer

Lets assume there are 2 files.

ravis-MacBook-Pro:files raviramadoss$ cat file.csv
first_file_first_record
first_file_second_record
first_file_third_record
first_file_fourth_record
first_file_fifth_record
first_file_sixth_record
ravis-MacBook-Pro:files raviramadoss$ cat file_2.csv
second_file_first_record
second_file_second_record
second_file_third_record
second_file_fourth_record
second_file_fifth_record
second_file_sixth_record
second_file_seventh_record
second_file_eight_record

Scala Code

sc.wholeTextFiles("/Users/raviramadoss/files").flatMap( _._2.split("\n").drop(5) ).collect()

Output:

res41: Array[String] = Array(first_file_sixth_record, second_file_sixth_record, second_file_seventh_record, second_file_eight_record)