Geoffrey THIESSET Geoffrey THIESSET - 1 year ago 234
Scala Question

Huge insert to HBase

I have an issue when I try to insert data to HBase.

I have a 12 million lines Spark DataFrame with 2 fields :

* KEY, a md5 hash
* MATCH, a boolean ("1" or "0")

I need to store it in an HBase table, KEY is the rowkey and MATCH is a column.

I created the table with a split on rowkey :

create 'GTH_TEST', 'GTH_TEST', {SPLITS=> ['10000000000000000000000000000000',

I use the HBase shc connector from Hortonworks like this :

.options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice))

This code never ends. It starts inserting data to HBase and runs forever (at least 35 hours before I killed it). It performs 11984/16000 tasks, always the same number of tasks.

I made a single change :

.options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice))

With the limit(Int.MaxValue), it takes 4/5 minutes to insert 12 million lines.

Can somebody explain this behaviour ? Is there a max_connexions on HBase side ?
Is there some tuning to do on HBase or Spark side ?

Thanks !


Answer Source

We finally changed the HBase connector.

With the it.nerdammer.spark.hbase (via RDD), it works perfectly.

import it.nerdammer.spark.hbase._
