What is the correct way to access the log4j logger of Spark using pyspark on an executor?
It's easy to do so in the driver but I cannot seem to understand how to access the logging functionalities on the executor so that I can log locally and let YARN collect the local logs.
Is there any way to access the local logger?
The standard logging procedure is not enough because I cannot access the spark context from the executor.
You cannot use local log4j logger on executors. Python workers spawned by executors jvms has no "callback" connection to the java, they just receive commands. But there is a way to log from executors using standard python logging and capture them by YARN.
On you HDFS place python module file that configures logging once per python worker and proxies logging functions (name it
import os import logging import sys class YarnLogger: @staticmethod def setup_logger(): if not 'LOG_DIRS' in os.environ: sys.stderr.write('Missing LOG_DIRS environment variable, pyspark logging disabled') return file = os.environ['LOG_DIRS'].split(',') + '/pyspark.log' logging.basicConfig(filename=file, level=logging.INFO, format='%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s') def __getattr__(self, key): return getattr(logging, key) YarnLogger.setup_logger()
Then import this module inside your application:
spark.sparkContext.addPyFile('hdfs:///user/_hc_gtouts/logger.py') import logger logger = logger.YarnLogger()
And you can use in inside your pyspark functions like normal logging library:
def map_sth(s): logger.info("Mapping " + str(s)) return s spark.range(10).rdd.map(map_sth).count()