max max - 5 months ago 25
Python Question

How to decode HTML entities in Spark?

I'm reading a large collection of text files into a DataFrame. Initially it will just have one column,

value
. The text files use HTML encoding (i.e., they have
<
instead of
<
, etc.). I want to decode all of them back to normal characters.

Obviously, I could do it with a UDF, but it would be super slow.

I could try regexp_replace, but it would be even slower, since there's over 200 named entities, and each would require its own regexp function. Each regexp_replace call will need to parse the entire line of text, searching for one specific encoded character at a time.

What is a good approach?

Answer

Since you read plain text input I would simply skip UDF part and pass data to JVM after initial processing. With Python 3.4+:

import html
from pyspark.sql.types import StringType, StructField, StructType

def clean(s):
    return html.unescape(s), 

(sc.textFile("README.md")
    .map(clean)
    .toDF(StructType([StructField("value", StringType(), False)])))
Comments