Venu A Positive Venu A Positive - 3 months ago 180
Scala Question

How to find spark RDD/Dataframe size?

I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?

Scala:

object Main extends App {
val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
println(file.length)
}


Spark:


val distFile = sc.textFile(file)
println(distFile.length)


but if i process it not getting file size. How to find the RDD size?

Answer

Yes Finally I got the solution. Include these libraries.

import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd
import org.apache.spark.util.SizeEstimator

How to find the RDD Size:

def calcRDDSize(rdd: RDD[String]): Long = {
  rdd.map(_.getBytes("UTF-8").length.toLong)
     .reduce(_+_) //add the sizes together
}

Function to find DataFrame size: (This function just convert DataFrame to RDD internally)

val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path

val rddOfDataframe = dataFrame.rdd.map(_.toString())

val size = calcRDDSize(rddOfDataframe)