Vern Vern - 1 month ago 14
JSON Question

Insert Bulk of data to Postgresql

I am suffering from performance issues when inserting a milion rows into a PostgreSQL database.

I am sending a JSON object which has an array with a milion rows.

For each row i create a record in the database table. I also tried this with multiple insert at once, but the problem remains.

I am not sure how to handle this, I have read that the COPY command is fastests.

How can I improve the performance?

My JSON Object with log as an array:
array Log has a milion rows.

{"type":"monitoring","log":[
["2016-10-12T20:33:21","0.00","0.00","0.00","0.00","0.0","24.00","1.83","-0.00","1","1","-100.00"],
["2016-10-12T20:33:23","0.00","0.00","0.00","0.00","0.0","24.00","1.52","-0.61","1","1","-100.00"]]}


My current code(I am building a dynamic statement so that i can execute multiple rows at once):

IF(NOT b_first_line) THEN
s_insert_query_values = right(s_insert_query_values, -1); --remove te leading comma

EXECUTE format('INSERT INTO log_rlda
(record_node_id, log_line, log_value, timestamp, record_log_id)
VALUES %s;', s_insert_query_values);

s_insert_query_values = '';
i_num_lines_buffered = 0;
END IF;
END IF;


s_insert_query_values contains:

Each value inside the array within "log", needs to be inserted in its own row(in colum: log_value). This is how the INSERT looks like(reference to s_insert_query_values):

INSERT INTO log_rlda
(record_node_id, log_line, log_value, timestamp, record_log_id)
VALUES
(806, 1, 0.00, '2016-10-12 20:33:21', 386),
(807, 1, 0.00, '2016-10-12 20:33:21', 386),
(808, 1, 0.00, '2016-10-12 20:33:21', 386),
(809, 1, 0.00, '2016-10-12 20:33:21', 386),
(810, 1, 0.0, '2016-10-12 20:33:21', 386),
(811, 1, 24.00, '2016-10-12 20:33:21', 386),
(768, 1, 1.83, '2016-10-12 20:33:21', 386),
(769, 1, 0.00, '2016-10-12 20:33:21', 386),
(728, 1, 1, '2016-10-12 20:33:21', 386),
(771, 1, 1, '2016-10-12 20:33:21', 386),
(729, 1, -100.00, '2016-10-12 20:33:21', 386),
(806, 2, 0.00, '2016-10-12 20:33:23', 386),
(807, 2, 0.00, '2016-10-12 20:33:23', 386),
(808, 2, 0.00, '2016-10-12 20:33:23', 386),
(809, 2, 0.00, '2016-10-12 20:33:23', 386),
(810, 2, 0.0, '2016-10-12 20:33:23', 386),
(811, 2, 24.00, '2016-10-12 20:33:23', 386),
(768, 2, 1.52, '2016-10-12 20:33:23', 386),
(769, 2, -0.61, '2016-10-12 20:33:23', 386),
(728, 2, 1, '2016-10-12 20:33:23', 386),
(771, 2, 1, '2016-10-12 20:33:23', 386),
(729, 2, -100.00, '2016-10-12 20:33:23', 386)


Solution(i_node_id_list contains ID's i selected before this query):

SELECT i_node_id_list[log_value_index] AS record_node_id,
e.log_line-1 AS log_line,
items.log_value::double precision as log_value,
to_timestamp((e.line->>0)::text, 'YYYY-MM-DD HH24:MI:SS') as "timestamp",
i_log_id as record_log_id
FROM (VALUES (log_data::json)) as data (doc),
json_array_elements(doc->'log') with ordinality as e(line, log_line),
json_array_elements_text(e.line) with ordinality as items(log_value, log_value_index)
WHERE log_value_index > 1 --dont include timestamp value (shouldnt be written as log_value)
AND log_line > 1

Answer

You need two levels of unnesting.

select e.log_line, items.log_value, e.line -> 0 as timestamp
from (
  values ('{"type":"monitoring","log":[
  ["2016-10-12T20:33:21","0.00","0.00","0.00","0.00","0.0","24.00","1.83","-0.00","1","1","-100.00"],
  ["2016-10-12T20:33:23","0.00","0.00","0.00","0.00","0.0","24.00","1.52","-0.61","1","1","-100.00"]]}'::json)
) as data (doc), 
  json_array_elements(doc->'log') with ordinality as e(line, log_line), 
  json_array_elements(e.line)   with ordinality as items(log_value, log_value_index)
where log_value_index > 1;

The first call to json_array_elements() extracts all array elements from the log attribute. The with ordinality allows us to identify each row in that array. The second call then gets each element from the lines, again the with ordinality allows us to find out the position in the array.

The above query returns this:

log_line | log_value | timestamp            
---------+-----------+----------------------
       1 | "0.00"    | "2016-10-12T20:33:21"
       1 | "0.00"    | "2016-10-12T20:33:21"
       1 | "0.00"    | "2016-10-12T20:33:21"
       1 | "0.00"    | "2016-10-12T20:33:21"
       1 | "0.0"     | "2016-10-12T20:33:21"
       1 | "24.00"   | "2016-10-12T20:33:21"
       1 | "1.83"    | "2016-10-12T20:33:21"
       1 | "-0.00"   | "2016-10-12T20:33:21"
       1 | "1"       | "2016-10-12T20:33:21"
       1 | "1"       | "2016-10-12T20:33:21"
       1 | "-100.00" | "2016-10-12T20:33:21"
       2 | "0.00"    | "2016-10-12T20:33:23"
       2 | "0.00"    | "2016-10-12T20:33:23"
       2 | "0.00"    | "2016-10-12T20:33:23"
       2 | "0.00"    | "2016-10-12T20:33:23"
       2 | "0.0"     | "2016-10-12T20:33:23"
       2 | "24.00"   | "2016-10-12T20:33:23"
       2 | "1.52"    | "2016-10-12T20:33:23"
       2 | "-0.61"   | "2016-10-12T20:33:23"
       2 | "1"       | "2016-10-12T20:33:23"
       2 | "1"       | "2016-10-12T20:33:23"
       2 | "-100.00" | "2016-10-12T20:33:23"

The result of the above statement can then be used to directly insert the data without looping over it. That should be a lot faster then doing a lot individual inserts.

I'm unsure however how you can integrate the correct record_node_id or record_log_id into the above result.

Comments