Tiffany Hatsune Tiffany Hatsune - 2 years ago 71
Scala Question

Applying several Indexers in Spark Mllib

Here's my code :

val workindexer = new StringIndexer().setInputCol("workclass").setOutputCol("workclassIndex")
val workencoder = new OneHotEncoder().setInputCol("workclassIndex").setOutputCol("workclassVec")

val educationindexer = new StringIndexer().setInputCol("education").setOutputCol("educationIndex")
val educationencoder = new OneHotEncoder().setInputCol("educationIndex").setOutputCol("educationVec")

val maritalindexer = new StringIndexer().setInputCol("marital_status").setOutputCol("maritalIndex")
val maritalencoder = new OneHotEncoder().setInputCol("maritalIndex").setOutputCol("maritalVec")

val occupationindexer = new StringIndexer().setInputCol("occupation").setOutputCol("occupationIndex")
val occupationencoder = new OneHotEncoder().setInputCol("occupationIndex").setOutputCol("occupationVec")

val relationindexer = new StringIndexer().setInputCol("relationship").setOutputCol("relationshipIndex")
val relationencoder = new OneHotEncoder().setInputCol("relationshipIndex").setOutputCol("relationshipVec")

val raceindexer = new StringIndexer().setInputCol("race").setOutputCol("raceIndex")
val raceencoder = new OneHotEncoder().setInputCol("raceIndex").setOutputCol("raceVec")

val sexindexer = new StringIndexer().setInputCol("sex").setOutputCol("sexIndex")
val sexencoder = new OneHotEncoder().setInputCol("sexIndex").setOutputCol("sexVec")

val nativeindexer = new StringIndexer().setInputCol("native_country").setOutputCol("native_countryIndex")
val nativeencoder = new OneHotEncoder().setInputCol("native_countryIndex").setOutputCol("native_countryVec")

val labelindexer = new StringIndexer().setInputCol("label").setOutputCol("labelIndex")


Is there any way to apply all these encoders and indexers without creating countless intermediate dataframes ?

Answer Source

I'd use RFormula:

import org.apache.spark.ml.feature.RFormula

val features = Seq("workclass", "education", 
   "marital_status", "occupation", "relationship", 
   "race", "sex", "native", "country")

val formula = new RFormula().setFormula(s"label ~ ${features.mkString(" + ")}")

It will apply the same transformations as the indexers used in the example and assemble the features Vector.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download