Justin Justin - 2 months ago 46
Java Question

Inserting Analytic data from Spark to Postgres

I have Cassandra database from which i analyzed the data using SparkSQL through Apache Spark. Now i want to insert those analyzed data into PostgreSQL . Is there any ways to achieve this directly apart from using the PostgreSQL driver (I achieved it using postREST and Driver i want to know whether there is any methods like



At the moment there is no native implementation of writing the RDD to any DBMS. Here are the links to the related discussions in the Spark user list: one, two

In general, the most performant approach would be the following:

  1. Validate the number of partitions in RDD, it should not be too low and too high. 20-50 partitions should be fine, if the number is lower - call repartition with 20 partitions, if higher - call coalesce to 50 partitions
  2. Call the mapPartition transformation, inside of it call the function to insert the records to your DBMS using JDBC. In this function you open the connection to your database and use the COPY command with this API, it would allow you to eliminate the need for a separate command for each record - this way the insert would be processed much faster

This way you would insert the data into Postgres in a parallel fashion utilizing up to 50 parallel connection (depends on your Spark cluster size and its configuration). The whole approach might be implemented as a Java/Scala function accepting the RDD and the connection string