jamiet jamiet - 1 month ago 9
Python Question

Take n rows from a spark dataframe and pass to toPandas()

I have this code:

l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.withColumn('age2', df.age + 2).toPandas()


Works fine, does what it needs to. Suppose though I only want to display the first n rows, and then call
toPandas()
to return a pandas dataframe. How do I do it? I can't call
take(n)
because that doesn't return a dataframe and thus I can't pass it to
toPandas()
.

So to put it another way, how can I take the top n rows from a dataframe and call
toPandas()
on the resulting dataframe? Can't think this is difficult but I can't figure it out.

I'm using Spark 1.6.0.

Answer

You can use limit(n) function:

l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.limit(2).withColumn('age2', df.age + 2).toPandas()

OR

l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.withColumn('age2', df.age + 2).limit(2).toPandas()