titipat titipat - 2 months ago 15
Python Question

Running PySpark using Cronjob (crontab)

First, I assume that we have

SPARK_HOME
set up, in my case it's at
~/Desktop/spark-2.0.0
. Basically, I want to run my PySpark script using Cronjob (e.g.
crontab -e
). My question is how to add environment path to make Spark script works with Cronjob. Here is my sample script,
example.py


import os
from pyspark import SparkConf, SparkContext

# Configure the environment
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = '~/Desktop/spark-2.0.0'

conf = SparkConf().setAppName('example').setMaster('local[8]')
sc = SparkContext(conf=conf)

if __name__ == '__main__':
ls = range(100)
ls_rdd = sc.parallelize(ls, numSlices=10)
ls_out = ls_rdd.map(lambda x: x+1).collect()

f = open('test.txt', 'w')
for item in ls_out:
f.write("%s\n" % item) # save list to test.txt


My bash script in
run_example.sh
is as follows

rm test.txt

~/Desktop/spark-2.0.0/bin/spark-submit \
--master local[8] \
--driver-memory 4g \
--executor-memory 4g \
example.py


Here, I want to run
run_example.sh
every minutes using
crontab
. However, I don't know how to custom path when I run
crontab -e
. So far, I only see this Gitbook link. I have something like this in my Cronjob editor that doesn't run my code yet.

#!/bin/bash

# add path to cron (this line is the one I don't know)
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:$HOME/anaconda/bin

# run script every minutes
* * * * * source run_example.sh


Thanks in advance!

Answer

What you can do is , add following line in your .bashrc file in home location.

export PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:$HOME/anaconda/bin

then you can have following entry in crontab

* * * * * source ~/.bashrc;sh run_example.sh

This line will execute your .bashrc file first, which will set the PATH value, then it will execute run_example.sh

Alternatively, you can set the PATH in run_example.sh only,

export PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:$HOME/anaconda/bin
rm test.txt

~/Desktop/spark-2.0.0/bin/spark-submit \
  --master local[8] \
  --driver-memory 4g \
  --executor-memory 4g \
  example.py