javadba javadba - 1 year ago 320
Python Question

How to access SparkContext in pyspark script

The following SOF question How to run script in Pyspark and drop into IPython shell when done? tells how to launch a pyspark script:

%run -d

But how do we access the existin spark context?

Just creating a new one does not work:

----> sc = SparkContext("local", 1)

ValueError: Cannot run multiple SparkContexts at once; existing
SparkContext(app=PySparkShell, master=local) created by <module> at

But trying to use an existing one .. well what existing one?

In [50]: for s in filter(lambda x: 'SparkContext' in repr(x[1]) and len(repr(x[1])) < 150, locals().iteritems()):
print s
('SparkContext', <class 'pyspark.context.SparkContext'>)

i.e. there is no variable for a SparkContext instance

Answer Source

Standalone python script for wordcount : write a reusable spark context by using contextmanager

from contextlib import contextmanager
from pyspark import SparkContext
from pyspark import SparkConf


def spark_manager():
    conf = SparkConf().setMaster(SPARK_MASTER) \
                      .setAppName(SPARK_APP_NAME) \
                      .set("spark.executor.memory", SPARK_EXECUTOR_MEMORY)
    spark_context = SparkContext(conf=conf)

        yield spark_context

with spark_manager() as context:
    File = "/home/ramisetty/sparkex/"  # Should be some file on your system
    textFileRDD = context.textFile(File)
    wordCounts = textFileRDD.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

print "WordCount - Done"

to launch: