Geoffrey THIESSET Geoffrey THIESSET - 3 months ago 47
Scala Question

Huge insert to HBase

I have an issue when I try to insert data to HBase.

I have a 12 million lines Spark DataFrame with 2 fields :

* KEY, a md5 hash
* MATCH, a boolean ("1" or "0")


I need to store it in an HBase table, KEY is the rowkey and MATCH is a column.

I created the table with a split on rowkey :

create 'GTH_TEST', 'GTH_TEST', {SPLITS=> ['10000000000000000000000000000000',
'20000000000000000000000000000000','30000000000000000000000000000000',
'40000000000000000000000000000000','50000000000000000000000000000000',
'60000000000000000000000000000000','70000000000000000000000000000000',
'80000000000000000000000000000000','90000000000000000000000000000000',
'a0000000000000000000000000000000','b0000000000000000000000000000000',
'c0000000000000000000000000000000','d0000000000000000000000000000000',
'e0000000000000000000000000000000','f0000000000000000000000000000000']}


I use the HBase shc connector from Hortonworks like this :

df.write
.options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()


This code never ends. It starts inserting data to HBase and runs forever (at least 35 hours before I killed it). It performs 11984/16000 tasks, always the same number of tasks.

I made a single change :

df.limit(Int.MaxValue)
.write
.options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()


With the limit(Int.MaxValue), it takes 4/5 minutes to insert 12 million lines.

Can somebody explain this behaviour ? Is there a max_connexions on HBase side ?
Is there some tuning to do on HBase or Spark side ?

Thanks !

Geoffrey

Answer

We finally changed the HBase connector.

With the it.nerdammer.spark.hbase (via RDD), it works perfectly.

import it.nerdammer.spark.hbase._
rdd.toHBaseTable(tableName)
   .toColumns("MATCHED")
   .inColumnFamily(cfName)
   .save()
Comments