gsamaras - 4 months ago 13

Python Question

I have an RDD called

`codes`

`In [76]: codes.collect()`

Out[76]:

[(u'3362336966', (6208, 5320)),

(u'7889466042', (4140, 5268))]

and I am trying to get this:

`In [76]: codes.collect()`

Out[76]:

[(u'3362336966', 6208),

(u'3362336966', 5320),

(u'7889466042', 4140),

(u'7889466042', 5268)]

How to do this?

My failed attempt:

`In [77]: codes_in = codes.map(lambda x: (x[0], x[1][0]), (x[0], x[1][1]))`

---------------------------------------------------------------------------

NameError Traceback (most recent call last)

<ipython-input-77-e1c7925bc075> in <module>()

----> 1 codes_in = codes.map(lambda x: (x[0], x[1][0]), (x[0], x[1][1]))

NameError: name 'x' is not defined

Answer

I think what you want is the following:

```
codes_in = codes.map(lambda x: [(x[0], p) for p in x[1]]).flatMap(lambda x: x)
```

If it is python 2, for legibility you could:

```
codes_in = codes.map(lambda k, vs: [(k, v) for v in vs]).flatMap(lambda x: x)
```

By this way you will be able to "extract" each value associated with the key and force that every row is a record of form `(k, v)`

.