Newbie Newbie - 1 month ago 12
Scala Question

How to generate Path with DateTimeFormat for pattern yyyy/mm/dd/HH

I am using spark/scala to load files from s3. My files are located under :

s3://bucket/yyyy/mm/dd/HH/parts...files


I need to generate the file paths with startDate(string) and endDate(string)

import org.joda.time.{DateTime, DateTimeZone}
import org.joda.time.Days
import org.joda.time.DurationFieldType
import org.joda.time.LocalDate
import org.joda.time.format.DateTimeFormat
import org.joda.time.format.DateTimeFormatter

val startDate = "2016-09-25T04:00:00Z"

val endDate = "2016-10-23T04:00:00Z"

val s3Bucket = "s3://test_bucket/"

def getUtilDate(timestamp: String): java.sql.Date = new java.sql.Date(new DateTime(timestamp, DateTimeZone.UTC).toDate().getTime())

val start = new LocalDate(getUtilDate(startDate))

val end = new LocalDate(getUtilDate(endDate))

val days: Int = Days.daysBetween(start, end).getDays

val files: Seq[String] = (0 to days)
.map(start.plusDays)
.map(d => s"$s3Bucket${DateTimeFormat.forPattern("yyyy/MM/dd/HH").print(d)}/*")

val testFiles = sc.textFile(files.mkString(","), 20000)

val df = sqlContext.read.json(testFiles)


Since sqlContext.read.json() doesn't take multiple paths.

But this doesn't give the HH. It shows as
s3://test_bucket/2016/09/26/��/*


Can someone tell me why the HH shows as ��. Is there any way I could get all the hours between two days i.e. between
"2016-09-25T04:00:00Z" and "2016-10-23T04:00:00Z"

like

s3://test_bucket/2016/09/25/04/*.....
to......s3://test_bucket/2016/10/23/04/*

Answer

You have used LocalDate which is a date-only class, it explicitly does not contain time information (this is different to java.sql.Date which contains time and date info). Therefore Joda cannot render the "HH" as hour, as it does not have that info.

Try instead:

val startDate = "2016-09-25T04:00:00Z"

val endDate = "2016-10-23T04:00:00Z"

val s3Bucket = "s3://test_bucket/"

def getUtilDate(timestamp: String): org.joda.time.DateTime =
  new DateTime(timestamp, DateTimeZone.UTC)

val start = getUtilDate(startDate)

val end = getUtilDate(endDate)

val days: Int = Days.daysBetween(start, end).getDays

val files: Seq[String] = (0 to days)
  .map(start.plusDays)
  .map(d => s"$s3Bucket${DateTimeFormat.forPattern("yyyy/MM/dd/HH").print(d)}/*")

println(files)

Update: to list all the hours between the days

To list each hour between the two DateTimes, you need to loop from start to end, using "plusHours" each time. In most languages you'd use a "for" loop for that, but Scala doesn't have a C-style for loop. There are two main ways to do this in Scala; I've shown both below:

val startDate = "2016-09-25T04:00:00Z"
val endDate = "2016-10-23T04:00:00Z"

val s3Bucket = "s3://test_bucket/"

def getUtilDate(timestamp: String): org.joda.time.DateTime =
  new DateTime(timestamp, DateTimeZone.UTC)

val start = getUtilDate(startDate)
val end = getUtilDate(endDate)

val fmt = DateTimeFormat.forPattern("yyyy/MM/dd/HH")
def bucketName(date: DateTime): String = s"$s3Bucket${fmt.print(date)}"

{
  // Imperative style:
  var t = start
  val files = mutable.Buffer[String]()
  do {
    files += bucketName(t)
    t = t.plusHours(1)
  } while (t.compareTo(end) < 0)

  println(files)
}

{
  // Functional style:
  @tailrec
  def loop(t: DateTime, acc: Seq[String]): Seq[String] = t match {
    case `end` => acc
    case _ =>
      loop(
        t.plusHours(1),
        acc :+ bucketName(t))
  }

  val files = loop(start, Vector())

  println(files)
}
Comments