Rian Ashwin Rian Ashwin - 1 year ago 100
HTML Question

Getting Column Headers from multiple html 'tbody'

I need to get the column headers from the second tbody in this url.


Specifically, i would like to see "september, october"... etc.

I am getting the following error:

runfile('C:/Python27/Lib/site-packages/xy/workspace/webscrape/mpob1.py', wdir='C:/Python27/Lib/site-packages/xy/workspace/webscrape')
Traceback (most recent call last):

File "<ipython-input-8-ab4005f51fa3>", line 1, in <module>
runfile('C:/Python27/Lib/site-packages/xy/workspace/webscrape/mpob1.py', wdir='C:/Python27/Lib/site-packages/xy/workspace/webscrape')

File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)

File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)

File "C:/Python27/Lib/site-packages/xy/workspace/webscrape/mpob1.py", line 26, in <module>
soup.findAll('tbody', limit=2)[1].findAll('tr').findAll('th')]

IndexError: list index out of range

can anyone here please help me out? I shall be eternally grateful!

have posted my code below:

import requests

from bs4 import BeautifulSoup

import pandas as pd

url = "http://bepi.mpob.gov.my/index.php/statistics/price/daily.html"

r = requests.get(url)

soup = BeautifulSoup(r.text, 'lxml')

column_headers = [th.getText() for th in
soup.findAll('tbody', limit=2)[1].findAll('tr').findAll('th')]

Answer Source

When you click "View Price" button a POST request is sent to the http://bepi.mpob.gov.my/admin2/price_local_daily_view3.php endpoint. Simulate this POST request and parse the resulting HTML:

import requests
from bs4 import BeautifulSoup

with requests.Session() as session:

    response = session.post("http://bepi.mpob.gov.my/admin2/price_local_daily_view3.php", data={
        "tahun": "2016",
        "bulan": "9",
        "Submit2222": "View Price"
    soup = BeautifulSoup(response.content, 'lxml')

    table = soup.find("table", id="hor-zebra")
    headers = [td.get_text() for td in table.find_all("tr")[2].find_all("td")]

Prints the headers of the table:

[u'Tarikh', u'September', u'October', u'November', u'December', u'September', u'October', u'November', u'December', u'September', u'October', u'November', u'December']
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download