I am working on a project where I have to report the hourly unique visitors per source. That is I have to calculate unique visitors for each source for each hour. Visitors are identified by a unique id. What should be the design so that calculation of hourly unique visitors is efficient considering the data is of the order of 20k entries per 8 hours.
At present I am using sourceid+
visitorid as the row key.
Let's start by saying that 2500k entries per hour is a pretty low volume of data (not even 1/second). Unless you want to scale massively your project would be easily achievable with a single SQL server.
Anyway, you have 2 options:
Log every visitorid+source and run a job (like mapreduce) to analyze the data every hour, or every day depending on your needs. In this case you can even completely avoid hbase and just stick to hadoop. You can log the data to a different file each hour, process it afterwards and store the results in SQL (or in HBase if you wish). Performance wise this would be the best approach.
Track the data realtime by making use of HBase counters, in this case I'd consider using 2 tables:
Table unique_users: to track the last time a visitorid has visited the site (rowkey would be visitorid+source or just visitorid, depending on if a visitor id can have different sources or just one). This table can have a TTL of 3600 seconds if you want to automatically discard old data as soon as you can but I would let a few days of data.
Table date_source_stats: to track the unique visitorid per source per hour. This table can have a TTL of a few weeks or even years depending on your retention requirements.
When a visitor enters your site you read the unique_users table to check the last access date, if that date is older than 1 hour consider it a new visit and increment the counter for the date+hour+sourceid combination in the date_source_stats table. Afterwards, update the unique_users to set the last visit time to the current time.
That way, you can easily retrieve all the unique visits for a particular date+hour with a scan and get all the sources. You may also consider a source_date_stats table in case you want to perform queries for an specific source, i.e, an hourly report for last 7 days for X source... (you can even store all the stats in the same table by using different rowkeys).
Please notice a few things about this approach:
- I've not being too detailed about the schemas, let me know if you need me to.
- I would also store total visits in another counter (which would be incremented always regardless of if it's unique or not), it's an useful value.
- This proposal can be easily extended as much as you want to also track daily, weekly, and even monthly unique visitors, you'll just need more counters and rowkeys: date+sourceid, month+sourceid... In this case you can have multiple column families with distinct TTL properties to adjust the retention policy of each set.
- This proposal could face hotspotting issues due rowkeys being sequential if you have thousands of reqs per second, you can read more about it here.
- An alternative approach for date_source_stats could be to opt for a wide design in which you have just a sourceid as rowkey and the date_hour as columns.