Justin Chiu Justin Chiu - 5 months ago 27
Python Question

How can I get the information in the table using BeautifulSoup?

I'm trying to get the information in the table from this website: http://indiawater.gov.in/IMISReports/Reports/WaterQuality/rpt_WQM_LaboratoryInformation_S.aspx?Rep=0&RP=Y

When I inspect the page, the data can be found in td's with the class oddrowcolor and evenrowcolor. However, when I try and get the information, nothing is outputted. How can I get the information in the table using BeautifulSoup for Python?

Below is my code:

import requests
from bs4 import BeautifulSoup
url = "http://indiawater.gov.in/IMISReports/Reports/WaterQuality/rpt_WQM_LaboratoryInformation_S.aspx?Rep=0&RP=Y"
r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")

for tr in soup.find_all('tr', {'class':'oddrowcolor'):
print tr


I tried with the oddrowcolor, but nothing outputted.

Answer

You can use the table id to get the table but the oddrowcolor etc.. is dynamically added so it is not in the source:

import requests
from bs4 import BeautifulSoup
url = "http://indiawater.gov.in/IMISReports/Reports/WaterQuality/rpt_WQM_LaboratoryInformation_S.aspx?Rep=0&RP=Y"
r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")
table = soup.select_one("#tableReportTable")

for tr in table.find_all("tr"):
    print tr

To pull the table data, you can do something like:

soup = BeautifulSoup(r.content, "html.parser")
table = soup.select_one("#tableReportTable")
# column names
print(", ".join([th.text.strip() for th in table.select_one("tr").find_all("th")]))

for tr in table.select("tr + tr"):
    # get row text from each anchor inside the row tds
    print(",".join([a.text for a in tr.select("td a")]))

Which gives you:

S.No., State, State Labs (without mobile labs), District Labs (without mobile labs), Block Labs/Total Blocks (without mobile labs), SubDivision Labs (without mobile labs), Mobile Labs (State/ District/ Block/ Sub-division Level), Total Labs   (State/ District/ Block/ Sub-division Level)

ANDAMAN and NICOBAR,1,0,NA / 9,0,2,3
ANDHRA PRADESH,1,32,NA / 662,73,0,106
ARUNACHAL PRADESH,1,17,NA / 100,31,0,49
ASSAM,1,29,NA / 242,53,20,103
BIHAR,1,41,NA / 536,0,0,42
CHANDIGARH,0,0,NA / 1,0,0,0
CHATTISGARH,1,27,NA / 146,20,5,53
DADRA & NAGAR HAVELI,0,0,NA / 10,0,0,0
DAMAN & DIU,0,0,NA / 1,0,0,0
DELHI,0,0,NA / 0,0,0,0
GOA,1,0,1 / 11,9,0,11
GUJARAT,1,34,50 / 246,0,6,91
HARYANA,0,21,NA / 126,21,0,42
HIMACHAL PRADESH,1,14,NA / 77,28,0,43
JAMMU AND KASHMIR,0,22,2 / 148,74,0,98
JHARKHAND,1,24,NA / 259,3,5,33
KARNATAKA,1,44,39 / 176,106,46,236
KERALA,1,14,NA / 148,33,0,48
LAKSHADWEEP,0,9,NA / 9,0,0,9
MADHYA PRADESH,1,51,3 / 313,106,0,161
MAHARASHTRA,1,44,2 / 351,139,0,186
MANIPUR,1,9,NA / 38,2,0,12
MEGHALAYA,1,7,NA / 42,22,0,30
MIZORAM,1,8,NA / 26,18,0,27
NAGALAND,0,11,NA / 74,1,2,14
ODISHA,1,32,NA / 314,42,0,75
PUDUCHERRY,0,2,NA / 3,0,0,2
PUNJAB,3,22,8 / 145,0,1,34
RAJASTHAN,1,33,163 / 295,0,0,197
SIKKIM,0,2,NA / 9,0,0,2
TAMIL NADU,1,34,NA / 385,49,0,84
TELANGANA,1,19,NA / 438,56,0,76
TRIPURA,1,8,7 / 58,6,0,22
UTTAR PRADESH,1,76,3 / 820,2,0,82
UTTARAKHAND,0,28,1 / 95,14,0,43
WEST BENGAL,1,18,NA / 341,201,0,220

That seems to match what I see in the browser, the Total etc.. is in th tags inside the last tr so adding the following outside the loop:

print(",".join([a.text.strip() for a in tr.select("th")])) 

Which would give you:

Total,27,732,279,1109,87,2234
Comments