rye rye - 1 year ago 53
Python Question

How do I order fields of my Row objects in Spark (Python)

I'm creating Row objects in Spark. I do not want my fields to be ordered alphabetically. However, if I do the following they are ordered alphabetically.

row = Row(foo=1, bar=2)

Then it creates an object like the following:

Row(bar=2, foo=1)

When I then create a dataframe on this object, the column order is going to be bar first, foo second, when I'd prefer to have it the other way around.

I know I can use "_1" and "_2" (for "foo" and "bar", respectively) and then assign a schema (with appropriate "foo" and "bar" names). But is there any way to prevent the Row object from ordering them?

Answer Source

But is there any way to prevent the Row object from ordering them?

There isn't. If you provide kwargs arguments will sorted by name and there is no workaround. Just use plain tuples:

rdd = sc.parallelize([(1, 2)])

and pass the schema as an argument to toDF

rdd.toDF(["foo", "bar"])

or createDataFrame:

from pyspark.sql.types import *

sqlContext.createDataFrame(rdd, ["foo", "bar"])

# With full schema
schema = StructType([
    StructField("foo", IntegerType(), False),
    StructField("bar", IntegerType(), False)])
sqlContext.createDataFrame(rdd, schema)

Finally you can sort columns by select:

sc.parallelize([Row(foo=1, bar=2)]).toDF().select("foo", "bar")
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download