Forepick Forepick - 7 months ago 40
Java Question

Cloud Dataflow: reading entire text files rather than lines by line

I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String.
I want to pass a pattern of JSON text files on gs://my_bucket/*/*.json, have a ParDo then process each and every file entirely.

What's the best approach to it?


I am going to give the most generally useful answer, even though there are special cases [1] where you might do something different.

I think what you want to do is to define a new subclass of FileBasedSource and use Read.from(<source>). Your source will also include a subclass of FileBasedReader; the source contains the configuration data and the reader actually does the reading.

I think a full description of the API is best left to the Javadoc, but I will highlight the key override points and how they relate to your needs:

  • FileBasedSource#isSplittable() you will want to override and return false. This will indicate that there is no intra-file splitting.
  • FileBasedSource#createForSubrangeOfFile(String, long, long) you will override to return a sub-source for just the file specified.
  • FileBasedSource#createSingleFileReader() you will override to produce a FileBasedReader for the current file (the method should assume it is already split to the level of a single file).

To implement the reader:

  • FileBasedReader#startReading(...) you will override to do nothing; the framework will already have opened the file for you, and it will close it.
  • FileBasedReader#readNextRecord() you will override to read the entire file as a single element.

[1] One example easy special case is when you actually have a small number of files, you can expand them prior to job submission, and they all take the same amount of time to process. Then you can just use Create.of(expand(<glob>)) followed by ParDo(<read a file>).