Satya Satya - 1 year ago 267
Python Question

pyspark,spark: how to select last row and also how to access pyspark dataframe by index

from a pyspark sql dataframe like

name age city
abc 20 A
def 30 B

How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe).

And how can I access the dataframe rows by row no. 12 or 200 .

In pandas I can do

df.tail(1) # for last row
df.ix[rowno or index] # by index
df.loc[] or by df.iloc[]

I am just curious how to access pyspark dataframe in such ways or alternative ways.


Answer Source

How to get the last row.

Long and ugly way which assumes that all columns are oderable:

from pyspark.sql.functions import (
    col, max as max_, struct, monotonically_increasing_id

last_row = (df
    .withColumn("_id", monotonically_increasing_id())
    .select(max(struct("_id", *df.columns))

If not all columns can be order you can try:

with_id = df.withColumn("_id", monotonically_increasing_id())
i ="_id")).first()[0]

with_id.where(col("_id") == i).drop("_id")

Note. There is last function in pyspark.sql.functions/ `o.a.s.sql.functions but considering description of the corresponding expressions it is not a good choice here.

how can I access the dataframe rows by

You cannot. Spark DataFrame and accessible by index. You can add indices using zipWithIndex and filter later. Just keep in mind this O(N) operation.