I have Cassandra database from which i analyzed the data using SparkSQL through Apache Spark. Now i want to insert those analyzed data into PostgreSQL . Is there any ways to achieve this directly apart from using the PostgreSQL driver (I achieved it using postREST and Driver i want to know whether there is any methods like
saveToCassandra()
At the moment there is no native implementation of writing the RDD to any DBMS. Here are the links to the related discussions in the Spark user list: one, two
In general, the most performant approach would be the following:
repartition
with 20 partitions, if higher - call coalesce
to 50 partitionsmapPartition
transformation, inside of it call the function to insert the records to your DBMS using JDBC. In this function you open the connection to your database and use the COPY command with this API, it would allow you to eliminate the need for a separate command for each record - this way the insert would be processed much fasterThis way you would insert the data into Postgres in a parallel fashion utilizing up to 50 parallel connection (depends on your Spark cluster size and its configuration). The whole approach might be implemented as a Java/Scala function accepting the RDD and the connection string