mongolol mongolol - 28 days ago 21
Java Question

Iterate through Files in Google Cloud Bucket

I am attempting to implement a relatively simple ETL pipeline that iterates through files in a google cloud bucket. The bucket has two folders: /input and /output.

What I'm trying to do is write a Java/Scala script to iterate through files in /input, and have the transformation applied to those that are not present in /output or those that have a timestamp later than that in /output. I've been looking through the Java API doc for a function I can leverage (as opposed to just calling

gsutil ls ...
), but haven't had any luck so far. Any recommendations on where to look in the doc?

def getBucketFolderContents(
bucketName: String
) = {
val credential = getCredential
val httpTransport = GoogleNetHttpTransport.newTrustedTransport()
val requestFactory = httpTransport.createRequestFactory(credential)
val uri = "" + URLEncoder.encode(
"UTF-8") +
val url = new GenericUrl(uri)
val request = requestFactory.buildGetRequest(uri)
val response = request.execute()


Answer Source

You can list objects under a folder by setting the prefix string on the object listing API: The results of listing are sorted, so you should be able to list both folders and then walk through both in order and generate the diff list.