gsamaras gsamaras - 1 year ago 155
Python Question

Towards limiting the big RDD

I am reading many images and I would like to work on a tiny subset of them for developing. As a result I am trying to understand how and could make that happen:

In [1]: d ='foo')
In [2]: x: x.photo_id).first()
Out[2]: u'28605'

In [3]: d.limit(1).map(lambda x: x.photo_id)
Out[3]: PythonRDD[31] at RDD at PythonRDD.scala:43

In [4]: d.limit(1).map(lambda x: x.photo_id).first()
// still running... what is happening? I would expect the limit() to run much faster than what we had in
, but that's not the case*.

Below I will describe my understanding, and please correct me, since obviously I am missing something:

  1. d
    is an RDD of pairs (I know that from the schema) and I am saying
    with the map function:

    i) Take every pair (which will be named
    and give me back the

    ii) That will result in a new (anonymous) RDD, in which we are applying the
    method, which I am not sure how it works$, but should give me the first element of that anonymous RDD.

  2. In
    , we limit the
    RDD to 1, which means that despite
    many elements, use only 1 and apply the map function to that one
    element only. The
    Out [3]
    should be the RDD created by the mapping.

  3. In
    , I would expect to follow the logic of
    and just print the one and only element of the limited RDD...

As expected, after looking at the monitor, [4] seems to process the whole dataset, while the others aren't, so it seems that I am not using
correctly, or that that's not what am I looking for:

enter image description here


tiny_d = d.limit(1).map(lambda x: x.photo_id) x: x.photo_id).first()

The first will give a
, which as described here, it will not actually do any action, just a transformation.

However, the second line will also process the whole dataset (as a matter of fact, the number of Tasks now are as many as before, plus one!).

*[2] executed instantly, while [4] is still running and >3h have passed..

$I couldn't find it in the documentation, because of the name.

Answer Source

Based on your code, here is simpler test case on Spark 2.0

case class my (x: Int)
val rdd = sc.parallelize(0.until(10000), 1000).map { x => my(x) }
val df1 = spark.createDataFrame(rdd)
val df2 = df1.limit(1) { r => r.getAs[Int](0) }.first { r => r.getAs[Int](0) }.first // Much slower than the previous line

Actually, Dataset.first is equivalent to Dataset.limit(1).collect, so check the physical plan of the two cases:

scala> { r => r.getAs[Int](0) }.limit(1).explain
== Physical Plan ==
CollectLimit 1
+- *SerializeFromObject [input[0, int, true] AS value#124]
   +- *MapElements <function1>, obj#123: int
      +- *DeserializeToObject createexternalrow(x#74, StructField(x,IntegerType,false)), obj#122: org.apache.spark.sql.Row
         +- Scan ExistingRDD[x#74]

scala> { r => r.getAs[Int](0) }.limit(1).explain
== Physical Plan ==
CollectLimit 1
+- *SerializeFromObject [input[0, int, true] AS value#131]
   +- *MapElements <function1>, obj#130: int
      +- *DeserializeToObject createexternalrow(x#74, StructField(x,IntegerType,false)), obj#129: org.apache.spark.sql.Row
         +- *GlobalLimit 1
            +- Exchange SinglePartition
               +- *LocalLimit 1
                  +- Scan ExistingRDD[x#74]

For the first case, it is related to an optimisation in the CollectLimitExec physical operator. That is, it will first fetch the first partition to get limit number of row, 1 in this case, if not satisfied, then fetch more partitions, until the desired limit is reached. So generally, if the first partition is not empty, only the first partition will be calculated and fetched. Other partitions will even not be computed.

However, in the second case, the optimisation in the CollectLimitExec does not help, because the previous limit operation involves a shuffle operation. All partitions will be computed, and running LocalLimit(1) on each partition to get 1 row, and then all partitions are shuffled into a single partition. CollectLimitExec will fetch 1 row from the resulted single partition.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download