Evan Zamir Evan Zamir - 2 months ago 21
Python Question

How to combine n-grams into one vocabulary in Spark?

Wondering if there is a built-in Spark feature to combine 1-, 2-, n-gram features into a single vocabulary. Setting

n=2
in
NGram
followed by invocation of
CountVectorizer
results in a dictionary containing only 2-grams. What I really want is to combine all frequent 1-grams, 2-grams, etc into one dictionary for my corpus.

Answer

You can train separate NGram and CountVectorizer models and merge using VectorAssembler.

from pyspark.ml.feature import NGram, CountVectorizer, VectorAssembler
from pyspark.ml import Pipeline


def build_ngrams(inputCol="tokens", n=3):

    ngrams = [
        NGram(n=i, inputCol="tokens", outputCol="{0}_grams".format(i))
        for i in range(1, n + 1)
    ]

    vectorizers = [
        CountVectorizer(inputCol="{0}_grams".format(i),
            outputCol="{0}_counts".format(i))
        for i in range(1, n + 1)
    ]

    assembler = [VectorAssembler(
        inputCols=["{0}_counts".format(i) for i in range(1, n + 1)],
        outputCol="features"
    )]

    return Pipeline(stages=ngrams + vectorizers + assembler)

Example usage:

df = spark.createDataFrame([
  (1, ["a", "b", "c", "d"]),
  (2, ["d", "e", "d"])
], ("id", "tokens"))

build_ngrams().fit(df).transform(df) 
Comments