goose goose - 3 years ago 209
Python Question

How to create date partitioned tables in GBQ? Can you use python?

I have just under 100M records of data that I wish to transform by denormalising a field and then input into a date partitioned GBQ table. The dates go back to 2001.

I had hoped that I could transform it with Python and then use GBQ directly from the script to accomplish this, but after reading up on this and particularly this document it doesn't seem straight-forward to create date-partitioned tables. I'm looking for a steer in the right direction.

Is there any working example of a python script that can do this? Or is it not possible to do via Python? Or is there another method someone can point me in the direction of?

Update

I'm not sure if I've missed something, but the tables created appear to be partitioned as per the insert date of when I'm creating the table and I want to partition by a date set within the existing dataset. I can't see anyway of changing this.

Here's what I've experimenting with:

import uuid
import os
import csv

from google.cloud import bigquery
from google.cloud.bigquery import SchemaField
from google.cloud.bigquery import Client
from google.cloud.bigquery import Table
import logging #logging.warning(data_store+file)
import json
import pprint

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path to service account credentials'

client = bigquery.Client()

dataset = client.dataset('test_dataset')
dataset.create()

SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = dataset.table('table_name', SCHEMA)
table.partitioning_type = "DAY"
table.create()

rows = [
('bob', 30),
('bill', 31)
]

table.insert_data(rows)


Is it possible to modify this to take control of the partitions as I create tables and insert data?

Update 2

It turns out I wasn't looking for table partitioning, for my use case it's enough to simply append a date serial to the end of my table name and then query with something along the lines of:

SELECT * FROM `dataset.test_dataset.table_name_*`
WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170702'


I don't know whether this is technically still partitioning or not, but as far as I can see it has the same benefits.

Answer Source

You can easily create date partitioned tables using the API and Python SDK. Simply set the timePartitioning field to DAY in your script:

https://github.com/GoogleCloudPlatform/google-cloud-python/blob/a14905b6931ba3be94adac4d12d59232077b33d2/bigquery/google/cloud/bigquery/table.py#L219

Or roll your own table insert request with the following body:

{
  "tableReference": {
    "projectId": "myProject",
    "tableId": "table1",
    "datasetId": "mydataset"
  },
  "timePartitioning": {
    "type": "DAY"
  }
}

Everything is just backed by the REST api here.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download