max max - 3 months ago 20x
Python Question

How to decode HTML entities in Spark?

I'm reading a large collection of text files into a DataFrame. Initially it will just have one column,

. The text files use HTML encoding (i.e., they have
instead of
, etc.). I want to decode all of them back to normal characters.

Obviously, I could do it with a UDF, but it would be super slow.

I could try regexp_replace, but it would be even slower, since there's over 200 named entities, and each would require its own regexp function. Each regexp_replace call will need to parse the entire line of text, searching for one specific encoded character at a time.

What is a good approach?


Since you read plain text input I would simply skip UDF part and pass data to JVM after initial processing. With Python 3.4+:

import html
from pyspark.sql.types import StringType, StructField, StructType

def clean(s):
    return html.unescape(s), 

    .toDF(StructType([StructField("value", StringType(), False)])))