Kardu Kardu - 1 year ago 54
Python Question

pyspark add new column field with the data frame row number

Hy, I'm trying build a recommendation system with Spark

I have a data frame with users email and movie rating.

df = pd.DataFrame(np.array([["aa@gmail.com",2,3],["aa@gmail.com",5,5],["bb@gmail.com",8,2],["cc@gmail.com",9,3]]), columns=['user','movie','rating'])

sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)

user movie rating
aa@gmail.com 2 3
aa@gmail.com 5 5
bb@gmail.com 8 2
cc@gmail.com 9 3

My first doubt it is, pySpark MLlib doesn't accept emails I'm correct? Because this I need to change the email by a Primary key.

My approach was create a temporary table, select distinct user and now I want add a new column with a row number (and this number will be the primary key for each user.


DistinctUsers = sqlContext.sql("Select distinct user FROM sparkdf")

What I have

| user|

What I want

| user| PK
|bb@gmail.com| 1
|aa@gmail.com| 2
|cc@gmail.com| 3

Next I will do a join and obtain my final data frame to use in MLlib

user movie rating
1 2 3
1 5 5
2 8 2
3 9 3

and thanks for your time.


Primary keys with Apache Spark practically answers your question but in this particular case using StringIndexer could be a better choice:

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="user", outputCol="user_id")
indexed = indexer.fit(sparkdf ).transform(sparkdf)