Robbert Robbert - 2 months ago 9
MySQL Question

How to do more then 50.000 inserts per second with tokudb?

Currently I'm testing out TokuDB and I'm very impressed. At this moment the inserts per second have peaked to just over 50.000 per second with two simultanious jobs running. The average insert rate is between 38.000 and 42.000 inserts per second.

I would like to go even higher, 100.000 inserts per second, as I will need to insert 1.2 billion calculated rows for now and about 6 billion more in the near future. I would like some advise on how to achieve this :-)

My current setup:


  1. Hardware: VPS with 4GB of RAM, 150GB SSD, 2 cores: Intel Westmere E56xx/L56xx/X56xx (Nehalem-C) 2.59GHz CPU

  2. Disk mount options: defaults,noatime

  3. OS: CentOS 6.8 64bit

  4. Database: Percona Server 5.7.14-8



My.cnf settings:

# TokuDB #
tokudb_cache_size = 2G
tokudb_commit_sync = 0
tokudb_fsync_log_period = 1000


TokuDB table layout:

CREATE TABLE `t1` (
`id` int(15) NOT NULL AUTO_INCREMENT,
`m_id` int(11) NOT NULL,
`c1` decimal(6,2) DEFAULT NULL,
`c2` decimal(6,2) DEFAULT NULL,
`c3` decimal(6,2) DEFAULT NULL,
`c4` decimal(6,2) DEFAULT NULL,
`c5` decimal(6,2) DEFAULT NULL,
`c6` decimal(6,2) DEFAULT NULL,
`c7` decimal(6,2) DEFAULT NULL,
`factor` decimal(4,2) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1

CREATE TABLE `t2` (
`id` int(15) NOT NULL AUTO_INCREMENT,
`v_id` int(15) NOT NULL,
`pid` int(11) DEFAULT NULL,
`amount` decimal(6,2) DEFAULT NULL,
`unit` int(1) DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1



I'm aware of the fact that I'm not using any indexes other then the
primary key index. This is due to the negative time impact the keys will
have on inserting. A cluster key for each table will be created at the
end of the insert job.


Additional MySQL commandline option:

SET unique_checks=OFF;


Somehow I'm not able to get this in the my.cnf.. If someone would know how then this would be greatly appreciated (currently unique_checks = off will block MySQL from starting due a unkown variable in the my.cnf).

The SQL statements are grouped in batches of 15.000. A PHP script generates the SQL statements and sends the query via mysqli_multiquery to the MySQL server:

<?PHP
foreach (generateCombinations($Arr) as $c) {

$QueryBatch[] = "insert into t1 values (NULL"
. ", " . $record->id
. ", " . rand(1, 35)
. ", " . rand(1, 140)
. ", " . rand(1, 20)
. ", NULL"
. ", " . rand(1, 14)
. ", " . rand(1, 300)
. ", " . rand(1, 4)
. ", NULL );";
$QueryBatch[] = "SET @t1id = LAST_INSERT_ID();";

$cntBatch++;

$pquery = array();
foreach ( $c as $key => $pid){

if ( is_null($pid) )
continue;

$pquery[] = "(NULL, @t1id, " . $pid . ", " . rand(1, 800) . ", 0)";

$cntBatch++;
}

$QueryBatch[] = "insert into t2 values " . implode(',', $pquery) . ";";

if ($cntBatch > 15000) {

$query = implode($QueryBatch);

if ( $mysqli->multi_query($query) ){
while ($mysqli->next_result()) {;}
} else {
printf("Errormessage: %s\n", $mysqli->error);
echo $query . "\n";
}

$cntBatch = 0;
unset($QueryBatch);
}

}
?>


Example of SQL insert statement:

insert into t1 values (NULL, 1 , 30, 100, 15, NULL, 10, 250, 2, NULL );
SET @t1id = LAST_INSERT_ID();
insert into t2 values (NULL, @t1id, 1, 750, 0),(NULL, @t1id, 1, 600, 0),(NULL, @t1id, 1, 500, 0),(NULL, @t1id, 1, 400, 0),(NULL, @t1id, 1, 300, 0),(NULL, @t1id, 1, 200, 0),(NULL, @t1id, 1, 100, 0);
insert into t1 values (NULL, 2 , 25, 95, 14, NULL, 11, 200, 3, NULL );
SET @t1id = LAST_INSERT_ID();
insert into t2 values (NULL, @t1id, 1, 600, 0),(NULL, @t1id, 1, 100, 0),(NULL, @t1id, 1, 300, 0),(NULL, @t1id, 1, 443, 0),(NULL, @t1id, 1, 521, 0),(NULL, @t1id, 1, 213, 0),(NULL, @t1id, 1, 433, 0);
[.. At least 14982 more..]

Answer

If it were me, I would cut down the number of statements being executed, and cut down the number of commits. I'm assuming that AUTO_COMMIT is enabled, given that we don't see any BEGIN TRANSACTION or COMMIT statements.

That's a whole bloatload of individual INSERT and SET statements. At least the inserts to the child table are using the multiple row insert, not separate insert statements for each row.

If I needed this to be fast, I would

  1. generate the id values for the t1 table, and include those in the INSERT statement
  2. do away with the call to LAST_INSERT_ID()
  3. use multi-row insert for t1 (rather than separate INSERT statement for each row)
  4. use BEGIN TRANSACTION and COMMIT
  5. run a single process to perform inserts into t1 (serialize) to avoid potential contention for locks

If it were for InnoDB, I'd also do SET FOREIGN_KEY_CHECKS=0.

There's already a boatload of calls to the rand function in the code; so incrementing an integer id for t1 isn't going to move the needle. When we start, we'd need a query to get the current AUTO_INCREMENT value, or get the MAX(id), whichever...

Basically, I'd cut down the number of statements being executed, and get more work done by each statement, and more work down before each COMMIT.

Inserting ten (10) t1 rows per statement would significantly reduce the number of statements that need to be executed.

BEGIN TRANSACTION;
-- insert ten rows into t1
INSERT INTO t1 (id,m_id,c1,c2,c3,c4,c5,c6,c7,factor) VALUES
 (444055501, 1 , 30, 100, 15, NULL, 10, 250, 2, NULL )
,(444055502, 2 , 25, 95, 14, NULL, 11, 200, 3, NULL )
, ...
,(444055510, 10 , 7, 45, 12, NULL, 10, 300, 4, NULL )
;
-- batch together the t2 rows associated with the ten t1 rows we just inserted
INSERT INTO t2 VALUES
-- 444055501  
 (NULL, 444055501, 1, 750, 0)
,(NULL, 444055501, 1, 600, 0)
,(NULL, 444055501, 1, 500, 0)
,(NULL, 444055501, 1, 400, 0)
,(NULL, 444055501, 1, 300, 0)
,(NULL, 444055501, 1, 200, 0)
,(NULL, 444055501, 1, 100, 0)
-- 444055502  
,(NULL, 444055502, 1, 600, 0)
,(NULL, 444055502, 1, 100, 0)
,(NULL, 444055502, 1, 300, 0)
,(NULL, 444055502, 1, 443, 0)
,(NULL, 444055502, 1, 521, 0)
,(NULL, 444055502, 1, 213, 0)
,(NULL, 444055502, 1, 433, 0)
-- 444055503
, ...
;

-- another ten rows into t1
INSERT INTO t1 (id,m_id,c1,c2,c3,c4,c5,c6,c7,factor) VALUES
 (444055511, 11 , 27, 94, 15, NULL, 10, 250, 11, NULL )
,(444055512, 12 , 24, 93, 14, NULL, 11, 200, 12, NULL )
, ...
,(444055520, 10 , 7, 45, 12, NULL, 10, 300, 4, NULL )
;
INSERT INTO t2 VALUES
 (NULL, 444055511, 1, 820, 0)
,(NULL, 444055511, 1, 480, 0)
, ...
;

-- repeat INSERTs into t1 and t2, and after 1000 loops
-- i.e. 10,000 t1 rows, do a commit
COMMIT;
BEGIN TRANSACTION;
INSERT INTO t1 ...

LOAD DATA INFILE

Any discussion of performance of inserts would be incomplete without at least mentioning LOAD DATA INFILE.

For best performance, that can't be beat. But since we don't have the data in a file, and we don't have key values (needed for the foreign key in t2, and we've got all the calls to rand to generate the data, LOAD DATA INFILE doesn't seem to be a good fit.