shakedzy shakedzy - 14 days ago 5
Scala Question

Spark: write to master's log from workers

I got this general piece of code that I'm running:

df.rdd.foreachPartition( i => {
//some code
//writing to log
})


The problem is that the
writing to log
is performed on the workers themselves, not on the master - and therefore the log entries are scattered somewhere, and are very hard - or even impossible sometimes - to retrieve. Is there a way to write to the master's logs from the workers, or some other work around?

Answer

There's no immediate way to write into the master's log - distributed processing means your code runs on various machines and therefore any access to a machine's resources (e.g. file system) is going to be distributed too.

There are a few ways you can achieve what you want:

  1. Treat logs as data: instead of using foreachPartition, you can use mapPartitions with a function that returns Iterator[String] with the log lines you want to write (or the data needed to construct them). Assuming the total number of log lines isn't huge, you can then collect them into driver machine and log them:

     val logLines = df.rdd.mapPartitions( i => {
       //some code
       val log: Iterator[String] = //construct log lines
       log
     }).collect()
    
     logLines.foreach(logger.info)
    
  2. Use some log-aggregation framework: these frameworks collect logs from multiple machines and can display them as a single stream of log entries. This is very useful for distributed computing as it makes accessing a specific machine's log redundant.

Comments