javadba javadba - 3 months ago 50
Python Question

How to access SparkContext in pyspark script

The following SOF question How to run script in Pyspark and drop into IPython shell when done? tells how to launch a pyspark script:

%run -d myscript.py


But how do we access the existin spark context?

Just creating a new one does not work:

----> sc = SparkContext("local", 1)

ValueError: Cannot run multiple SparkContexts at once; existing
SparkContext(app=PySparkShell, master=local) created by <module> at
/Library/Python/2.7/site-packages/IPython/utils/py3compat.py:204


But trying to use an existing one .. well what existing one?

In [50]: for s in filter(lambda x: 'SparkContext' in repr(x[1]) and len(repr(x[1])) < 150, locals().iteritems()):
print s
('SparkContext', <class 'pyspark.context.SparkContext'>)


i.e. there is no variable for a SparkContext instance

Answer

Standalone python script for wordcount : write a reusable spark context by using contextmanager

"""SimpleApp.py"""
from contextlib import contextmanager
from pyspark import SparkContext
from pyspark import SparkConf


SPARK_MASTER='local'
SPARK_APP_NAME='Word Count'
SPARK_EXECUTOR_MEMORY='200m'

@contextmanager
def spark_manager():
    conf = SparkConf().setMaster(SPARK_MASTER) \
                      .setAppName(SPARK_APP_NAME) \
                      .set("spark.executor.memory", SPARK_EXECUTOR_MEMORY)
    spark_context = SparkContext(conf=conf)

    try:
        yield spark_context
    finally:
        spark_context.stop()

with spark_manager() as context:
    File = "/home/ramisetty/sparkex/README.md"  # Should be some file on your system
    textFileRDD = context.textFile(File)
    wordCounts = textFileRDD.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
    wordCounts.saveAsTextFile("output")

print "WordCount - Done"

to launch:

/bin/spark-submit SimpleApp.py
Comments