Kratos Kratos - 14 days ago 5
Python Question

Is there equivalent to python tile in Spark?

I have a numpy array in python which I wanted to duplicate itself, therefore I used

tile(array(x), (2, 1))


This, given an array
[1,2,3]
will return
[[1,2,3],[1,2,3]]


But in pySpark I Have a pipelineRDD instead.
Is there a respective function for this purpose?
I am not able to find it.

Thank you

Answer

There is no equivalent:

  • RDD is a distributed collection of local object.
  • RDD cannot contain another RDD.
  • Local objects are limited to the size of memory and not useful to store content of a complete RDD.

You can repeat RDD in one dimension using:

sc.union([rdd for _ in range(n))

which is equivalent to

np.tile(a, n)

where n is a scalar.

Comments