hum_hum_pa hum_hum_pa - 17 days ago 6
PHP Question

Cache Dynamic Query PHP + BigQuery

I have a database in Google BigQuery with million of rows (more than 2 million new rows every day) contains of my user activities

I create a PHP program to get the insight from this database with many queries to show like statistic of data per day, per hour and many more

I have two cases with two problem:


  1. I try to find the data of user activities in date between 2016-11-01 and 2016-11-10, and then I need to break down the data for only 2016-11-05 data only (the data basically is the subset of queries result). This data need to be clasify per day or per hour or per user type and many more. Right now I use many queries in database to group this data and to do many data manipulation. For example "SELECT * FROM user_activities WHERE date>='2016-11-01' AND date<='2016-11-10' GROUP BY date,hour" and then when I need to break down the data in 2016-11-05 only I re-run the query: "SELECT * FROM user_activities WHERE date='2016-11-05' GROUP BY date,hour"

  2. Or sometimes I need to query the data with different parameter, for example the user activities between 2016-11-01 and 2016-11-10 who contains activities "A", and then I need to change witch activities "B". I have a column to identify the type of activities that user do. Right now I run the query like "SELECT * FROM user_activities WHERE activities like 'A' and then when the activities type is changed I run new query "SELECT * FROM user_activities WHERE activities like 'B'.



So my question is:
Because the data on my database is so big, and because the insight query activities in my PHP program is so high frequency, the cost of data management and processing become so high. For case like case 1 and 2 is there any alternate solution like PHP caching to make the database request become less?

In just 1-2 days my BigQuery data request can become Terabyte of data. I'm afraid it is not too efficient in term of my cost database management.

As far I have tried these solutions:


  1. I take the raw data from my database, cache it on PHP and run the
    data manipulation manually. For example I run "SELECT * FROM
    user_activities WHERE date>='2016-11-01' AND date<='2016-11-10'" and
    then I try to run data manipulation like group by hour or group by
    user type or group by user activities manually and sequentially on
    PHP function. But because my data contains million of data the
    process become so long and not efficient.

  2. I take the raw data from my database, insert it to temporary table,
    and then manipulate the data by query to temporary table. But this
    process become not efficient too because the insert process for
    million rows of data become so long.



Do you have any suggestion how can I optimize my problem?

Answer
  1. Implement the Partitioned Tables as has been recommended for you.
    If you have one single big table with 5TB of data without partition your costs are high.
    When you do Partitioned Tables, you have only the storage for those days to query not the whole table. Just a fraction of it, like 10GB or smaller. And you pay that only.

  2. You can save a query result into a table directly instead of reimporting as you say, and query only that table which is smaller for further aggregation.

  3. Try to not use 'SELECT *' instead just select the columns you must have in your output.
  4. If the data is enough small, and you do lots of small querios on it, you may want to take out from BQ and store in ElasticSearch or MySQL and run from there the queries.
Comments