deterjan deterjan - 4 months ago 15
HTML Question

Selecting tags by content in Jsoup and getting nth tag after the given tag

I have an HTML document I want to scrape data from. The tag of the data has no unique identifier except that it is the 13th

<td>
tag from the
<td>
tag containing the given string.

So, for example, the 10th
<td>
tag in the document contains the word "dog" ( ie
<td>dog</td>
. Also no other
<td>
tag in the document contains identical data.). Given only the word "dog", is it possible for me to extract the content inside the 23rd
<td>
tag in the document using Jsoup methods, and if so how?

Edit:

<td>Cat</td>
<td align="center">40</td>
<td align="center">67</td>
<td align="center">58<br>0</td>
<td align="center">32</td>
<td>Dog</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">58<br>0</td>
<td align="center">99</td>
<td>Snake</td>
<td align="center">7</td>
<td align="center">85</td>
<td align="center">58<br>0</td>
<td align="center">13</td>


In a document like this, given only the animal's name, I would like to be able to extract the number in the n'th tag from it, let's say 4. So given "Cat" I would like to find 32. Given "Dog", 99. And for snake 13. Assume there are hundreds of animals in the document.

Answer

You can use structural pseudo selectors to target the nth element.

doc.select("td:nth-child(23)");

Since you are looking for the row with Dog, you could select that row first.

Element dogRow = doc.select("tr:has(td:contains(dog))").first();

and then select the 23rd child

String cellValue = dogRow.select("td:nth-child(23)").first().ownText();

or combine them

String cellValue = doc
    .select("tr:has(td:contains(dog)) > td:nth-child(23)")
    .first()
    .ownText();

Edit

I reread your question and seems like you want to find Dog within a row and then find the nth sibling.

You could use the elementSiblingIndex and getElementsByIndexEquals for this:

    Element dogRow = doc.select("tr:has(td:contains(dog))").first();

    int dogCellIndex = dogRow
        .select("td:contains(dog)")
        .first()
        .elementSiblingIndex();

    int otherCellIndex = dogCellIndex + 10;

    String cellValue = dogRow
        .getElementsByIndexEquals(otherCellIndex)
        .text();