HyperDevil HyperDevil - 1 month ago 13
PHP Question

Amazon S3 - Store timebased files

I would like to use S3 object storage to store time based data, 1 file per minute.

Currently this is stored on EBS with a folder for year, month, date and files under the data folder for every minute of the day.

I see no issue filesystem wise, to store the files on object storage, the question is if i want to "query" S3 to retrieve specific time intervals, is that possible?

If not what would be the best way to implement a "search" function on top?

Have a simpledb, do exact file matching etc?
Does anybody have experience with this?

I am going to use PHP SDK for S3.

Answer

Amazon S3 does not have a "query" language. The best you can do is to organize files into prefixes and limit results based on that.

For example, if your objects in S3 were to be:

year-month-day-hour-minute-second.txt

Then you can list objects by:

  • a certain year: 2016-
  • a certain month: 2016-10-
  • a certain day: 2016-10-31-

and so on using prefixes.

But you cannot do very specific time ranges within S3. If you want to query based on a specific time range, then you would need to collect the daily/monthly/yearly results yourself, then trim away what you want to exclude.

For example, if you wanted to query objects between 12:01pm October 29 and 12:01pm October 31, then you'd could collect objects from the following prefixes:

  • 2016-10-29-
  • 2016-10-30-
  • 2916-10-31-

and manually remove items before and after your desired time range.

If you want to do better querying, then you're best off using a database designed for querying. SimpleDB may work. DynamoDB and SQL will work. You could dump a file into S3, then record it's object key and timestamp in the database.

On query, select from the db, then retrieve files from S3 as needed.

Update: An example using prefixes

Suppose you have a bunch of minutely files as such:

2016-10-29-00-00.txt 2016-10-29-00-01.txt 2016-10-29-00-02.txt ... 2016-10-30-00-00.txt 2016-10-30-00-01.txt ... 2016-10-31-00-00.txt ... 2016-11-01-00-00.txt

And so on.

Then you can do the following searches using prefixes:

  • To get all files from 2016: prefix = "2016-"
  • To get all files from October 2016: prefix = "2016-10-"
  • To get all files from October 30, 2016: prefix = "2016-10-30-"
  • To get all files from 00:00 to 00:59 on October 30, 2016: prefix = "2016-10-30-00"
  • To get all files from the minute at 00:05 on October 30, 2016: prefix = "2016-10-30-00-05"

S3 cannot do range searches, such as:

  • Files between 12:00 on October 29, 2016 and 11:59 October 31, 2016

Instead, you have 2 options:

Option 1: Retrieve objects from S3 for each day in your date range using prefixes:

  • "2016-10-29-"
  • "2016-10-30-"
  • "2016-10-31-"

Once you have that list, you would combine them, and take away the files from before and after your desired time range.

Option 2: Retrieve objects from S3 for each month in your date range using prefixes:

  • "2016-10-"

Again, once you have that list, you would combine them, and take away the files from before and after your desired time range.

Which you choose depends on how many distinct days you'll need to retrieve compared to the number of objects returned on a search by month.

The logic for this would get quite complex. A proper searchable db may be worthwhile.