Alex Pupyshev Alex Pupyshev - 1 year ago 190
Python Question

Spark-submit: undefined function parse_url

The function - parse_url always works fine if we working with spark-sql throw sql-client (via thrift server), IPython, pyspark-shell, but it doesn't work throw spark-submit mode:

/opt/spark/bin/spark-submit --driver-memory 4G --executor-memory 8G

The error is:

Traceback (most recent call last):
File "/home/spark/***/", line 167, in <module>
)v on = and reg_path = oldtrack_page and registration_day = day_cl_log and date_cl_log <= registration_date""")
File "/opt/spark/python/lib/", line 552, in sql
File "/opt/spark/python/lib/", line 538, in __call__
File "/opt/spark/python/lib/", line 40, in deco
pyspark.sql.utils.AnalysisException: undefined function parse_url;
Build step 'Execute shell' marked build as failure
Finished: FAILURE

So, we are using workaround here:

def python_parse_url(url, que, key):
import urlparse
ians = None
if que == "QUERY":
ians = urlparse.parse_qs(urlparse.urlparse(url).query)[key][0]
elif que == "HOST":
ians = urlparse.urlparse(url).hostname
elif que == "PATH":
ians = urlparse.urlparse(url).path
return ians

def dc_python_parse_url(url, que, key):
ians = None
ians = python_parse_url(url, que, key)
return ians

sqlCtx.registerFunction('my_parse_url', dc_python_parse_url)

Please, any help with this issue?

Answer Source

Spark >= 2.0

Same as below, but use SparkSession with Hive support enabled:


Spark < 2.0

parse_url is not a classic sql function. It is a Hive UDF and as such requires HiveContext to work:

from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext

sc = SparkContext()

sqlContext = SQLContext(sc)
hivContext = HiveContext(sc)

query = """SELECT parse_url('', 'HOST')"""

## Py4JJavaError                             Traceback (most recent call last)
##   ...
## AnalysisException: 'undefined function parse_url;'

## DataFrame[_c0: string]
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download