ZedBrannigan ZedBrannigan - 1 year ago 96
Java Question

Using Talend to extract HTML Search Pages into .txt files based on input keywords. How can I parse this data End-to-End and write it to MySQL?

To add to the title: I now have a working workflow consinsting of two steps.

1) I extract the HTML Search Result pages for every keyword given in a input.txt file. - e.g.:

Business Intelligence;

Talend saved those results and writes them as HTML to
keywords_Business Intelligence.txt
. Attached is an image of the talend job.

Talend Workflow

2) I use Java Code to import these files (one by one) - Parse the Data out of the DOM Structure using the JSoup Library. Straigt away, the data gets written into a MySQL Database.

Here is my problem: It all works fine for now, but the requirement is to completely automate the process in the future, so it can run on a server periodically.

Therefore I thought to include my Java Code in Talend - which got my stuck, because I wasn't able to import the mysql connector and the jsoup.jar.

Where I need your help is either to advise me how to connect to my existing Talend workflow - or you are maybe thinking of an easier solution, which I'm just not thinking of right now.

I have to add, I'm quite new to coding, and it was a big leap to come this far with parsing and writing into a DB. With your help throughout the process, I got more comfortable though. I hope you can help me solve this problem. Thank you in advance for your time spent.

Answer Source

This can be done by using the tLoadLibrary component and putting the external jar file in <talendInstallDir>/lib/java

You can use the onSubJobOk or onComponentOK connections to connect to the next components.

Your tLibraryLoad component(s) should be first thing you do in your job.

You can also import classes/methods in tJava, tJavaRow under Advanced Properties in the component view and then use something like:

import org.apache.commons.lang3.math.NumberUtils;

to import the specific class you need (in this case, the Apache Commons NumberUtils).