Ezio Ezio - 1 month ago 6
HTML Question

How to access inner tags token in golang?

I am making a webscraper and i have never done it before so please point out if i am doing anything wrong

I am using golang to scrap

suppose i have been given a table

<table>
<tr>
<td>XYZ<td>
<td>XYZ<td>
<td>XYZ<td>
<tr>
<tr>
<td>XYZ<td>
<td>XYZ<td>
<td>XYZ<td>
<tr>
<tr>
<td>XYZ<td>
<td>XYZ<td>
<td>XYZ<td>
<tr>
<tr>
<td>XYZ<td>
<td>XYZ<td>
<td>XYZ<td>
<tr>
</table>


i want to extract data from each tr but only the second td

also can i return a new html string only having the content inside the table tag and remove everything elese in the html outside table tag?

Answer

Well first of all your HTML example is wrong, you missed all the close tags </ tr > and </ td >

For this kind of job is always better use some sort of DOM selectors like jQuery. For Go I recommend goquery, it's little library and works pretty well. Your solution:

package main

import (
    "log"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    doc, err := goquery.NewDocument("http://your.url.com/foo.html")
    if err != nil {
        log.Fatal(err)
    }

    doc.Find("table tr").Each(func(_ int, tr *goquery.Selection) {

        // for each <tr> found, find the <td>s inside
        // ix is the index
        tr.Find("td").Each(func(ix int, td *goquery.Selection) {

            // print only the td number 2 (index == 1)
            if ix == 1 {
                log.Printf("index: %d content: '%s'", ix, td.Text())
            }
        })
    })
}

As you may note td.Text() has the content of each td tag. I left you the full file that I used for testing https://play.golang.org/p/Rtb1Tqz1Wb