lilyrobin lilyrobin - 5 months ago 113
Python Question

How can I add a value to a row in pyspark?

I have a dataframe that looks like this:

preds.take(1)
[Row(_1=0, _2=Row(val1=False, val2=1, val3='high_school'))]


I want the whole thing to be one row, without the nested row in there. So, the first value would get a name and be a part of the one row object. If I wanted to name it "ID", it would look like this:

preds.take(1)
[Row(ID=0, val1=False, val2=1, val3='high_school')]


I've tried various things within a map, but nothing is producing what I'm looking for (or getting errors). I've tried:

preds.map(lambda point: (point._1, point._2))
preds.map(lambda point: point._2.append(point._1))
preds.map(lambda point: point._2['ID']=point._1)
preds.map(lambda point: (point._2).ID=point._1)

Answer

Since Row is a tuple and tuples are immutable you can only create a new object. Using plain tuples:

from pyspark.sql import Row

r = Row(_1=0, _2=Row(val1=False, val2=1, val3='high_school'))
r[:1] + r[1]
## (0, False, 1, 'high_school')

or preserving __fields__:

Row(*r.__fields__[:1] + r[1].__fields__)(*r[:1] + r[1])
## Row(_1=0, val1=False, val2=1, val3='high_school') 

In practice operating directly on rows should should be avoided in favor of using DataFrame DSL without fetching data to Python interpreter:

df = sc.parallelize([r]).toDF()

df.select("_1", "_2.val1", "_2.val2", "_2.val3")