DAR DAR - 1 year ago 97
Scala Question

error writing PubSub stream to Cloud Storage using Dataflow

Using SCIO from

to write a job for
, following 2 examples e.g1 and e.g2 to write a
stream to
, but get the following error for the below code


Exception in thread "main" java.lang.IllegalArgumentException: Write can only be applied to a Bounded PCollection


object StreamingPubSub {
def main(cmdlineArgs: Array[String]): Unit = {
// set up example wiring
val (opts, args) = ScioContext.parseArguments[ExampleOptions](cmdlineArgs)
val dataflowUtils = new DataflowExampleUtils(opts)

val sc = ScioContext(opts)

.timestampBy {
_ => new Instant(System.currentTimeMillis() - (scala.math.random * RAND_RANGE).toLong)
.groupBy(_ => Unit)

val result = sc.close()

// CTRL-C to cancel the streaming pipeline

I maybe mixing up the window concept with Bounded PCollection, is there a way to achieve this or do I need to apply some transform to make this happen, anyone can be of assistance on this

Answer Source

I believe SCIO's saveAsTextFile underneath uses Dataflow's Write transformation, which supports bounded PCollections only. Dataflow doesn't provide a direct API to write an unbounded PCollection to Google Cloud Storage yet, although this is something we are investigating.

To persist an unbounded PCollection somewhere, consider, for example, BigQuery, Datastore, or Bigtable. In SCIO's API, you could use, for example, saveAsBigQuery.