BlueChips23 BlueChips23 - 3 days ago 5
Java Question

Extract only certain columns in Java Spark

I have a file with 10 columns. What's the most elegant way to extract only first 3 columns or specific columns?

For example, this is how my file looks like:

john,smith,84,male,kansas
john,doe,48,male,california
tim,jones,22,male,delaware


And I want to extract into this:

[john, smith, kansas]
[john, doe, california]
[tim, jones, delaware]


What I have is this, but it doesn't specifically chose the columns that I want:

JavaRDD<String> peopleRDD = sc.textFile(DATA_FILE);
peopleRDD.cache().map(lines -> Arrays.asList(lines.split(",")))
.forEach(person -> LOG.info(person));


I read the following two Stackoverflow posts but I still can't decide how to do this.

EDIT:
I ended up doing the following:

JavaRDD<String> peopleRDD = sc.textFile(DATA_FILE);
peopleRDD.cache().map(lines -> Arrays.asList(new String[]{lines.split(",")[0],
lines.split(",")[1],
lines.split(",")[3]}
.forEach(person -> LOG.info(person));


Not the most elegant solution but if you have a better way, please post here. Thanks.

DNA DNA
Answer

EDIT: Apologies, I just realized you were asking for a Java solution, but I've used Scala. Only the 3rd of my suggestions has an equivalent in Java (added at the bottom of the answer)... Spark is really much nicer in Scala though :-)

One way is to perform the split, then pattern match on the result to select the columns you want:

peopleRDD.cache().map(_.split(",") match { case Array(a,b,_,_,e) => List(a,b,e) }) 

Another (depending on which combinations of elements you want) is to use take and drop, using a val to avoid splitting repeatedly.

peopleRDD.cache().map{ line => 
    val parts = line.split(",") 
    parts.take(2) ++ parts.drop(4)
}

(You can add a toList after the split if you want a List rather than an Array for each result element in the RDD)

In fact the same approach can be used to simplify your original solution, e.g:

peopleRDD.cache().map{ line => 
  val parts = line.split(",")
  List(parts[0], parts[1], parts[4])
}

In Java8, you can probably do the equivalent, which is a slight improvement as we avoid calling split repeatedly - something like:

peopleRDD.cache().map( line -> {
  Array<String> parts = line.split(",");
  Arrays.asList(new String[]{parts[0], parts[1], parts[4]});
})
Comments