XUser XUser - 3 months ago 32
Python Question

How to insert/append unstructured data to bigquery table

Background

I want to insert/append newline formatted JSON into

bigquery
table through python client API.

Eg:

{"name":"xyz",mobile:xxx,location:"abc"}
{"name":"xyz",mobile:xxx,age:22}


Issue is, all fields in a row are optional and there is no fixed defined schema for the data.

Query

I have read that we can use Federated tables, which supports autoschema detection.

However, I am looking for a feature, that would automatically detect schema from data,create tables accordingly and even adjust the table schema if any extra columns/keys appear in data instead of creating new table.

Would this be possible using python client API.

Answer

You can use autodetect with BigQuery load API, i.e. your example using bq cli tool will look like following:

~$ cat /tmp/x.json
{"name":"xyz","mobile":"xxx","location":"abc"}
{"name":"xyz","mobile":"xxx","age":"22"}

~$ bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON tmp.x /tmp/x.json
Upload complete.

~$ bq show tmp.x
Table tmp.x

   Last modified          Schema          Total Rows   Total Bytes   Expiration  
 ----------------- --------------------- ------------ ------------- ------------ 
  16 Aug 08:23:35   |- age: integer       2            33                        
                    |- location: string                                          
                    |- mobile: string                                            
                    |- name: string                                              


~$ bq query "select * from tmp.x"

+------+----------+--------+------+
| age  | location | mobile | name |
+------+----------+--------+------+
| NULL | abc      | xxx    | xyz  |
|   22 | NULL     | xxx    | xyz  |
+------+----------+--------+------+

Update: If later you need to add additional fields, you can use schema_update_option to allow new fields. Alas it doesn't yet work with autodetect, so you need to provide new schema explicitly to the load API:

~$ cat /tmp/x1.json 
{"name":"abc","mobile":"yyy","age":"25","gender":"male"}

~$ bq load --schema=name:STRING,age:INTEGER,location:STRING,mobile:STRING,gender:STRING --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON tmp.x /tmp/x1.json
Upload complete.

~$ bq show tmp.x
Table tmp.x

   Last modified          Schema          Total Rows   Total Bytes   Expiration  
 ----------------- --------------------- ------------ ------------- -----------
  19 Aug 10:43:09   |- name: string       3            57                        
                    |- age: integer                                              
                    |- location: string                                          
                    |- mobile: string                                            
                    |- gender: string                                            


~$ bq query "select * from tmp.x"
status: DONE   
+------+------+----------+--------+--------+
| name | age  | location | mobile | gender |
+------+------+----------+--------+--------+
| abc  |   25 | NULL     | yyy    | male   |
| xyz  | NULL | abc      | xxx    | NULL   |
| xyz  |   22 | NULL     | xxx    | NULL   |
+------+------+----------+--------+--------+
Comments