Newbie Newbie - 1 month ago 17
Scala Question

SPARK : How to generate s3 file path dynamically using date diff

I am trying to get the list of files between startDate and endDate and read the files from these folders :

For example my file structure looks like this: BucketName/year/month/day/files

s3://testBucket/2016/10/16/part00000


These files are all jsons. The issue is I need to load all paths between starDate and end Date :

With start day(10/16/2016) and end date (09/16/2016) I would like to read from 09/16/2016(inclusive) ....to .... 10/16/2016 (inclusive)

import org.joda.time.Days
import org.joda.time.DurationFieldType
import org.joda.time.LocalDate
import org.joda.time.format.DateTimeFormat
import org.joda.time.format.DateTimeFormatter

val s3Bucket: String = "S3://myTestBucket/"

val startTimestamp: String = "2016-09-16T00:00:00Z"
val endTimestamp: String = "2016-10-16T00:00:00Z"

val dtf: DateTimeFormatter = DateTimeFormat.forPattern( "yyyy-MMM-dd" )
val startDate: LocalDate = dtf.parseLocalDate( startTimestamp )

val endDate: LocalDate = dtf.parseLocalDate( endTimestamp )


val days: Int = Days.daysBetween( startDate, endDate ).getDays

System.out.print( days )

val dates = new ListBuffer[String]()
var i: Int = 0
while (i < days) {
{
val d: LocalDate = startDate.withFieldAdded( DurationFieldType.days, i )
val tempDate: String = s3Bucket + d.getYear + "/" + d.getMonthOfYear + "/" + d.getDayOfMonth + "/" + "*"
dates += tempDate
}
{
i += 1;
}
}
val dateList = dates.toList
val files = dateList.mkString(", ")
sqlContext.read.json(files)


Is this right way to do this ? Is there any other efficient way to do this ?

Answer

I don't think it can be much more efficient, but it's definitely not idiomatic (using while and var) and can be made shorter and more concise:

val s3Bucket: String = "S3://myTestBucket/"
val startDate: LocalDate = new LocalDate(2016, 9, 16)
val endDate: LocalDate = new LocalDate(2016, 10, 16)

val days: Int = Days.daysBetween(startDate, endDate).getDays

val pathDTF = DateTimeFormat.forPattern("yyyy/MM/dd")

val files: Seq[String] = (0 to days)
  .map(startDate.plusDays)
  .map(d => s"$s3Bucket${pathDTF.print(d)}/*")

val result = sqlContext.read.json(files: _*) 
Comments