mastersheel007 mastersheel007 - 3 months ago 35
Java Question

Data manipulation on all columns in Dataset with Java API

After reading csv file in Dataset, want to remove spaces from String type data using Java API.

Apache Spark 2.0.0

Dataset<Row> dataset = sparkSession.read().format("csv").option("header", "true").load("/pathToCsv/data.csv");
Dataset<String> dataset2 = dataset.map(new MapFunction<Row,String>() {

@Override
public String call(Row value) throws Exception {

return value.getString(0).replace(" ", "");
// But this will remove space from only first column
}
}, Encoders.STRING());


By using
MapFunction
, not able to remove spaces from all columns.

But in
Scala
, by using following way in
spark-shell
able to perform desired operation.

val ds = spark.read.format("csv").option("header", "true").load("/pathToCsv/data.csv")
val opds = ds.select(ds.columns.map(c => regexp_replace(col(c), " ", "").alias(c)): _*)


Dataset
opds
have data without spaces. Want to achieve same in Java. But in Java API
columns
method returns
String[]
and not able to perform functional programming on Dataset.

INPUT DATA

+----------------+----------+-----+---+---+
| x| y| z| a| b|
+----------------+----------+-----+---+---+
| Hello World|John Smith|There| 1|2.3|
|Welcome to world| Bob Alice|Where| 5|3.6|
+----------------+----------+-----+---+---+


EXPECTED OUTPUT DATA

+--------------+---------+-----+---+---+
| x| y| z| a| b|
+--------------+---------+-----+---+---+
| HelloWorld|JohnSmith|There| 1|2.3|
|Welcometoworld| BobAlice|Where| 5|3.6|
+--------------+---------+-----+---+---+

Answer

Try:

for (String col: dataset.columns) {
  dataset = dataset.withColumn(col, regexp_replace(dataset.col(col), " ", ""));
}
Comments