Alessandro Alessandro - 3 months ago 30
Scala Question

Scala iterator on pattern match

I need help to iterate this piece of code written in Spark-Scala with DataFrame. I'm new on Scala, so I apologize if my question may seem trivial.

The function is very simple: Given a dataframe, the function casts the column if there is a pattern matching, otherwise select all field.

/* Load sources */
val df = sqlContext.sql("select id_vehicle, id_size, id_country, id_time from " + working_database + carPark);


val df2 = df.select(
df.columns.map {
case id_vehicle @ "id_vehicle" => df(id_vehicle).cast("Int").as(id_vehicle)
case other => df(other)
}: _*
)


This function, with pattern matching, works perfectly!

Now I have a question: Is there any way to "iterate" this? In practice I need a function that given a
dataframe
, an
Array[String]
of column (column_1, column_2, ...) and another
Array[String]
of type (int, double, float, ...), return to me the same
dataframe
with the right cast at right position.

I need help :)

Answer
//Your supplied code fits nicely into this function
def castOnce(df: DataFrame, colName: String, typeName: String): DataFrame = {
    val colsCasted = df.columns.map{
        case colName => df(colName).cast(typeName).as(colName)
        case other => df(other)
    }
    df.select(colsCasted:_ *)
}

def castMany(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {

   assert(colNames.length == typeNames.length, "The lengths are different")
   val colsWithTypes: Array[(String, String)] = colNames.zip(typeNames)
   colsWithTypes.foldLeft(df)((cAndType, newDf) => castOnce(newDf, cAndType._1, cAndType._2))
}

When you have a function that you just need to apply many times to the same thing a fold is often what you want. The above code zips the two arrays together to combine them into one. It then iterates through this list applying your function each time to the dataframe and then applying the next pair to the resultant dataframe etc.

Based on your edit I filled in the function above. I don't have a compiler so I'm not 100% sure its correct. Having written it out I am also left questioning my original approach. Below is a better way I believe but I am leaving the previous one for reference.

def(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
    assert(colNames.length == typeNames.length, "The lengths are different")
    val nameToType: Map[String, String] = colNames.zip(typeNames).toMap
    val newCols= df.columns.map{dfCol =>
        nameToType.get(dfCol).map{newType => 
            df(dfCol).cast(newType).as(dfCol)
        }.getOrElse(df(dfCol))
    }
    df.select(newCols:_ *)
}

The above code creates a map of column name to the desired type. Then foreach column in the dataframe it looks the type up in the Map. If the type exists we cast the column to that new type. If the column does not exist in the Map then we default to the column from the DataFrame directly.

We then select these columns from the DataFrame

Comments