I have just under 100M records of data that I wish to transform by denormalising a field and then input into a date partitioned GBQ table. The dates go back to 2001.
I had hoped that I could transform it with Python and then use GBQ directly from the script to accomplish this, but after reading up on this and particularly this document it doesn't seem straight-forward to create date-partitioned tables. I'm looking for a steer in the right direction.
Is there any working example of a python script that can do this? Or is it not possible to do via Python? Or is there another method someone can point me in the direction of?
Update
I'm not sure if I've missed something, but the tables created appear to be partitioned as per the insert date of when I'm creating the table and I want to partition by a date set within the existing dataset. I can't see anyway of changing this.
Here's what I've experimenting with:
import uuid
import os
import csv
from google.cloud import bigquery
from google.cloud.bigquery import SchemaField
from google.cloud.bigquery import Client
from google.cloud.bigquery import Table
import logging #logging.warning(data_store+file)
import json
import pprint
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path to service account credentials'
client = bigquery.Client()
dataset = client.dataset('test_dataset')
dataset.create()
SCHEMA = [
SchemaField('full_name', 'STRING', mode='required'),
SchemaField('age', 'INTEGER', mode='required'),
]
table = dataset.table('table_name', SCHEMA)
table.partitioning_type = "DAY"
table.create()
rows = [
('bob', 30),
('bill', 31)
]
table.insert_data(rows)
SELECT * FROM `dataset.test_dataset.table_name_*`
WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170702'
You can easily create date partitioned tables using the API and Python SDK. Simply set the timePartitioning
field to DAY
in your script:
Or roll your own table insert request with the following body:
{
"tableReference": {
"projectId": "myProject",
"tableId": "table1",
"datasetId": "mydataset"
},
"timePartitioning": {
"type": "DAY"
}
}
Everything is just backed by the REST api here.