Kratos Kratos - 10 months ago 68
Python Question

Is there equivalent to python tile in Spark?

I have a numpy array in python which I wanted to duplicate itself, therefore I used

tile(array(x), (2, 1))

This, given an array
will return

But in pySpark I Have a pipelineRDD instead.
Is there a respective function for this purpose?
I am not able to find it.

Thank you

Answer Source

There is no equivalent:

  • RDD is a distributed collection of local object.
  • RDD cannot contain another RDD.
  • Local objects are limited to the size of memory and not useful to store content of a complete RDD.

You can repeat RDD in one dimension using:

sc.union([rdd for _ in range(n))

which is equivalent to

np.tile(a, n)

where n is a scalar.