gsinha gsinha - 2 months ago 14
Python Question

How to paginate in GCS when using GAE Python GCS Client Library for access ?

GCS = Google Cloud Storage

GAE = Google App Engine

If there is a huge number of files in a given directory (emulated directory since no real ones exist), how do I manage :


  1. Listing all files for some processing in my GAE Python code ?

  2. Sort in descending order of file name ( in directories where all files
    could be converted to numbers ) ?



listbucket() mentions about pagination but does not elaborate. I do not understand how to paginate using listbucket().

I used listbucket() as shown below :

import cloudstorage as gcs
::
bucket_name = os.environ.get ('BUCKET_NAME', app_identity.get_default_gcs_bucket_name ())


gcs_list_obj = gcs.listbucket ('/' + bucket_name + '/dir_1/dir_2/', delimiter="/")

# ITERATE THROUGH YEAR DIRECTORIES TO GET THE HIGHEST YEAR DIRECTORY NAME VALUE.
year_list = []
for item in gcs_list_obj:
# EACH "ITEM" WOULD BE A DIRECTORY REPRESENTING TIMESTAMP YEAR.
if item.is_dir:
# IT IS A DIRECTORY.
filename = item.filename
# EXTRACT YEAR FROM ABSOLUTE FILENAME.
year_name = ""
counter = len (filename) - 2 # START AT SECOND LAST CHARACTER.
while (filename[counter]!="/"):
year_name = filename[counter] + year_name
counter = counter - 1
# COLLECT ALL YEAR VALUES.
year_list.append ( int (year_name) )

# SORT THEM IN DESCENDING ORDER.
year_list = sorted (year_list, reverse=True)

Answer

cloudstorage.listbucket returns an iterator so you can "paginate" by only getting and showing N items at a time (e.g with itertools.islice from the standard Python library).

However it yields object info (instances of https://cloud.google.com/appengine/docs/python/googlecloudstorageclient/gcsfilestat_class) strictly in alphabetical order, and there's no way to change that (in particular to invert the order, as you seem to desire).

If you must show the objects in some different order, you'll have to forego actual pagination -- make a list in memory then sort it, as you're doing now (then you can present that sorted list in a "paginated" way of course, but meanwhile it's taken all that memory).

Feel free to open a feature request at https://code.google.com/p/googleappengine/issues/list of course -- there is currently no feature to have GCS sort things anyway but alphabetical ascending order by object name.