Satya Satya - 5 months ago 144
Python Question

pyspark,spark: how to select last row and also how to access pyspark dataframe by index

from a pyspark sql dataframe like

name age city
abc 20 A
def 30 B

How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe).

And how can I access the dataframe rows by row no. 12 or 200 .

In pandas I can do

df.tail(1) # for last row
df.ix[rowno or index] # by index
df.loc[] or by df.iloc[]

I am just curious how to access pyspark dataframe in such ways or alternative ways.



How to get the last row.

Long and ugly way which assumes that all columns are oderable:

from pyspark.sql.functions import (
    col, max as max_, struct, monotonically_increasing_id

last_row = (df
    .withColumn("_id", monotonically_increasing_id())
    .select(max(struct("_id", *df.columns))

If not all columns can be order you can try:

with_id = df.withColumn("_id", monotonically_increasing_id())
i ="_id")).first()[0]

with_id.where(col("_id") == i).drop("_id")

Note. There is last function in pyspark.sql.functions/ `o.a.s.sql.functions but considering description of the corresponding expressions it is not a good choice here.

how can I access the dataframe rows by

You cannot. Spark DataFrame and accessible by index. You can add indices using zipWithIndex and filter later. Just keep in mind this O(N) operation.