Venu A Positive Venu A Positive - 1 year ago 845
Scala Question

How to find spark RDD/Dataframe size?

I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?


object Main extends App {
val file = new"hdfs://localhost:9000/samplefile.txt").toString()


val distFile = sc.textFile(file)

but if i process it not getting file size. How to find the RDD size?

Answer Source

Yes Finally I got the solution. Include these libraries.

import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd
import org.apache.spark.util.SizeEstimator

How to find the RDD Size:

def calcRDDSize(rdd: RDD[String]): Long = {"UTF-8").length.toLong)
     .reduce(_+_) //add the sizes together

Function to find DataFrame size: (This function just convert DataFrame to RDD internally)

val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path

val rddOfDataframe =

val size = calcRDDSize(rddOfDataframe)