Hasson Hasson - 18 days ago 5
Java Question

Save to cassandra in spark, parallelize method is not availble in java

I am trying to save row only to a cassandra table using spark in java (this comes as a result after long processing in spark), I am using the new method to connect to cassandra using spark session as follows:

SparkSession spark = SparkSession
.builder()
.appName("App")
.config("spark.cassandra.connection.host", "cassandra1.example.com")
.config("spark.cassandra.connection.port", "9042")
.master("spark://cassandra.example.com:7077")
.getOrCreate();


The connection is successful and works well as I have Spark installed on the same nodes as cassandra, after reading some RDDs from cassandra I want to save to another table in cassandra, so I am following the documentation here, namely, the part to save to cassandra as follows:

List<Person> people = Arrays.asList(
new Person(1, "John", new Date()),
new Person(2, "Troy", new Date()),
new Person(3, "Andrew", new Date())
);
JavaRDD<Person> rdd = spark.sparkContext().parallelize(people);
javaFunctions(rdd).writerBuilder("ks", "people", mapToRow(Person.class)).saveToCassandra();


The problem which I am facing is that parallelize method is not accespted, and only a scala version looks avaiable, the error is:

The method parallelize(Seq<T>, int, ClassTag<T>) in the type
SparkContext is not applicable for the arguments (List<Person>)


How can I use that in Java to save to cassandra table?

Answer

To parallelize java.util.List you can use JavaSparkContext (not SparkContext):

import org.apache.spark.api.java.JavaSparkContext;

JavaSparkContext sc = new JavaSparkContext(spark.sparkContext);
sc.parallelize(people);