HalfPintBoy HalfPintBoy - 6 months ago 109
Python Question

pyspark matrix with dummy variables

Have two columns:

ID Text
1 a
2 b
3 c


How can I able to create matrix with dummy variables like this:

ID a b c
1 1 0 0
2 0 1 0
3 0 0 1


Using pyspark library and its features?

Answer
df = sqlContext.createDataFrame([
    (1, "a"),
    (2, "b"),
    (3, "c"),
], ["ID", "Text"])

categories = df.select("Text").distinct().rdd.flatMap(lambda x: x).collect()

exprs = [F.when(F.col("Text") == category, 1).otherwise(0).alias(category)
         for category in categories]

df.select("ID", *exprs).show()

Output

+---+---+---+---+
| ID|  a|  b|  c|
+---+---+---+---+
|  1|  1|  0|  0|
|  2|  0|  1|  0|
|  3|  0|  0|  1|
+---+---+---+---+
Comments