Saulo Ricci Saulo Ricci - 2 months ago 23
Java Question

Querying a relational database through Google DataFlow Transformer

I would like to implement a

ParDo
Transformer on my Dataflow Pipeline, that basically query a relational database based on the data provided by each element to be processed. I know every attribute in an user defined transformer must be serializable, but to query data to a database, using
jdbc
I need to create a
Connection
that is naturally non serializable object.

Is still possible to do that in the Dataflow Pipeline context?

Answer

Yes it is possible. You could make your Connection object transient so that its not serialized and create it once per bundle through the startBundle method. Once all the elements in the bundle are processed, the connection can be closed through the finishBundle method.

class MyDoFn extends DoFn<X, Y> {
  private transient Connection jdbc;

  // Called once per bundle
  public void startBundle(Context c) {
    jdbc = // Create connection
  }

  public void processElement(ProcessContext c) {
    // query database
  }

  public void finishBundle(Context c) {
    // close connection
  }
}