lilyrobin lilyrobin - 4 months ago 90
Python Question

How can I add a value to a row in pyspark?

I have a dataframe that looks like this:

[Row(_1=0, _2=Row(val1=False, val2=1, val3='high_school'))]

I want the whole thing to be one row, without the nested row in there. So, the first value would get a name and be a part of the one row object. If I wanted to name it "ID", it would look like this:

[Row(ID=0, val1=False, val2=1, val3='high_school')]

I've tried various things within a map, but nothing is producing what I'm looking for (or getting errors). I've tried: point: (point._1, point._2)) point: point._2.append(point._1)) point: point._2['ID']=point._1) point: (point._2).ID=point._1)


Since Row is a tuple and tuples are immutable you can only create a new object. Using plain tuples:

from pyspark.sql import Row

r = Row(_1=0, _2=Row(val1=False, val2=1, val3='high_school'))
r[:1] + r[1]
## (0, False, 1, 'high_school')

or preserving __fields__:

Row(*r.__fields__[:1] + r[1].__fields__)(*r[:1] + r[1])
## Row(_1=0, val1=False, val2=1, val3='high_school') 

In practice operating directly on rows should should be avoided in favor of using DataFrame DSL without fetching data to Python interpreter:

df = sc.parallelize([r]).toDF()"_1", "_2.val1", "_2.val2", "_2.val3")