kamalbanga kamalbanga - 1 year ago 298
Python Question

How to return a "Tuple type" in a UDF in PySpark?

All the data types in


__all__ = [
"DataType", "NullType", "StringType", "BinaryType", "BooleanType", "DateType",
"TimestampType", "DecimalType", "DoubleType", "FloatType", "ByteType", "IntegerType",
"LongType", "ShortType", "ArrayType", "MapType", "StructField", "StructType"]

I have to write a UDF (in pyspark) which returns an array of tuples. What do I give the second argument to it which is the return type of the udf method? It would be something on the lines of

Answer Source

There is no such thing as a TupleType in Spark. Product types are represented as structs with fields of specific type. For example if you want to return an array of pairs (integer, string) you can use schema like this:

from pyspark.sql.types import *

schema = ArrayType(StructType([
    StructField("char", StringType(), False),
    StructField("count", IntegerType(), False)

Example usage:

from pyspark.sql.functions import udf
from collections import Counter

char_count_udf = udf(
    lambda s: Counter(s).most_common(),

df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["id", "value"])

df.select("*", char_count_udf(df["value"])).show(2, False)

## +---+-----+-------------------------+
## |id |value|PythonUDF#<lambda>(value)|
## +---+-----+-------------------------+
## |1  |foo  |[[o,2], [f,1]]           |
## |2  |bar  |[[r,1], [a,1], [b,1]]    |
## +---+-----+-------------------------+
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download