mongolol mongolol - 28 days ago 21
Java Question

Iterate through Files in Google Cloud Bucket

I am attempting to implement a relatively simple ETL pipeline that iterates through files in a google cloud bucket. The bucket has two folders: /input and /output.

What I'm trying to do is write a Java/Scala script to iterate through files in /input, and have the transformation applied to those that are not present in /output or those that have a timestamp later than that in /output. I've been looking through the Java API doc for a function I can leverage (as opposed to just calling

gsutil ls ...
), but haven't had any luck so far. Any recommendations on where to look in the doc?

def getBucketFolderContents(
bucketName: String
) = {
val credential = getCredential
val httpTransport = GoogleNetHttpTransport.newTrustedTransport()
val requestFactory = httpTransport.createRequestFactory(credential)
val uri = "https://www.googleapis.com/storage/v1/b/" + URLEncoder.encode(
bucketName,
"UTF-8") +
"o/raw%2f"
val url = new GenericUrl(uri)
val request = requestFactory.buildGetRequest(uri)
val response = request.execute()
response

}
}

Answer Source

You can list objects under a folder by setting the prefix string on the object listing API: https://cloud.google.com/storage/docs/json_api/v1/objects/list The results of listing are sorted, so you should be able to list both folders and then walk through both in order and generate the diff list.