Stijn Stijn -4 years ago 122
Python Question

Transform string column to vector column Spark DataFrames

I have a Spark dataframe that looks as follows:

+-----------+-------------------+
| ID | features |
+-----------+-------------------+
| 18156431|(5,[0,1,4],[1,1,1])|
| 20260831|(5,[0,4,5],[2,1,1])|
| 91859831|(5,[0,1],[1,3]) |
| 206186631|(5,[3,4,5],[1,5]) |
| 223134831|(5,[2,3,5],[1,1,1])|
+-----------+-------------------+


In this dataframe the features column is a sparse vector. In my scripts I have to save this DF as file on disk. When doing this, the features column is saved as as text column: example
"(5,[0,1,4],[1,1,1])"
.
When importing again in Spark the column stays string, as you could expect. How can I convert the column back to (sparse) vector format?

Answer Source

Not particularly efficient (it would be a good idea to use a format that preserves types) due to UDF overhead but you can do something like this:

from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf

df = sc.parallelize([
    (18156431, "(5,[0,1,4],[1,1,1])")
]).toDF(["id", "features"])

parse = udf(lambda s: Vectors.parse(s), VectorUDT())
df.select(parse("features"))

Please note this doesn't port directly to 2.0.0+ and ML Vector. Since ML vectors don't provide parse method you'd have to parse to MLLib and use asML.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download