realspirituals realspirituals - 1 year ago 242
MySQL Question

Pentaho Table input to table output in batches

My input table in MySQL has 20 Million records and the target table in Oracle is empty. I need to load the entire table from MySQL into Oracle. I am simply using a Table Input and Table Output step.

PentahoProcessing

My intention is not to lock the source table for a long time whilst reading.
Is there a problem with the load (Number of records) I am trying to achieve?
I could see

Use batch update for inserts
option in the Table Output. I could not see something similar in the Table Input. Is there a way to perform batch processing in Pentaho?

Answer Source

Don't worry, 20 millions records is a small number for PDI, and you will not lock the table if it is open for input. That's why the Bulk load is for output table not input tables.

A common beginer trap however, is the Truncate table option on the output table. If you run (inadertantly or for parallel processing) twice the output step, each one will lock the other. Forever.

To speed up: You may use the Lazy conversion check box on input, so that the data remains in byte format until it is used. But I am not sure you win something on a simple input/output table. if something wrong happens with Dates or Blobs on writing on the output file, the error message will be quite cryptic.

You can also increase the speed of the output by inceasing the commit size (worst a few trials in Oracle), and by inceasing the number of rows in row set which will increase the number of rows read by the table input. To do so right-click anywhere then Properties/Miscelanous.

Something I really advise to do, is to increase the JVM memory size. Use an editor (notepad or better) to edit the file named spoon.bat. You'll find around line 94-96 a line containing someting like "-Xmx256K". Change it to "-Xmx4096M" (where 4096 is half the size of your machine RAM).

To perform "batch processing" has many meaning. One of them beeing Make the transformation database transactional. Which you can do with the check box just below the above mentionned Number of row in rowset (and buggily spelled as Make the transformation database in PDI latest version). With that box checked, if something goes wrong the state of the databases is rolled back as if the transformation was never executed. But I don't advise to do this in your case.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download