Rishi Menon Rishi Menon - 2 months ago 8
HTML Question

How to web scrape data using Python from an html table and store it in a csv file. I am able to extract some parts but not the others

I am beginner in Web scraping and I have become very much interested in the process. I set for myself a Project that can keep me motivated till I completed the project.

My Project

My Aim is to write a Python Program that goes to my university results page and scrape all the results of a range of students and store each of their marks in each subject in a .csv file or , delimited text file. I have gotten the code working to submit the post request to the .asp page. I would appreciate it if you could guide me on how to store the subject wise details in separate columns like:

Desired Output:

Sl.no,Name,Subject1,Subject2,Subject3,Subject4,Subject5,Subject6,..etc

1,Jason,8,9,8,8,8,9..etc

2,Peter,6,8,9,8,7,7..etc

.

.

.

for a series of exam numbers.

Some Sample Data to try it out

The Results Website: http://result.pondiuni.edu.in/candidate.asp

Register Number: 15te1218

Degree: BTHEE

Exam: Second

Could anyone give me directions on how I am to accomplish the task?
Please correct me and would be awesome if you could guide me to solve the problem.

Can this be done in a much more simple way ?

In the code below you can see that I have tried to print out the name of the student but it returns an empty set(doesn't work). and i don't want it to return the data as a set because there is only one occurrence of that detail.

I do not know how to extract the Subject Names and the corresponding mark of that student from the html table in the results page. Some help with this is needed.

Code:

import requests
from bs4 import BeautifulSoup
import re
import csv

for x in xrange(44,47):

EXAMNO ='15te12'+str(x)
print EXAMNO

data = {"txtregno": EXAMNO,
"cmbdegree": r"BTHEE~\BTHEE\result.mdb", # use raw strings
"cmbexamno": "B",
"dpath": r"\BTHEE\result.mdb",
"dname": "BTHEE",
"txtexamno": "B"}

results_page = requests.post("http://result.pondiuni.edu.in/ResultDisp.asp", data=data).content
soup = BeautifulSoup(results_page, 'html.parser').prettify()
regpa= "<!--Percentage / S.G.P.A : <b>(.+?) </b>&nbsp;&nbsp;&nbsp; -->"
patterngpa =re.compile(regpa)
gpa=re.findall(patterngpa,soup)
print gpa
rename="<font size=3 color=black>(.+?)</font>"
patternname=re.compile(rename)
name=re.findall(patternname,soup)
print (name)


OUTPUT:

15te1244
[u'8.67']
15te1245
[u'8.8']
[]
15te1246
[u'7.8']
[]


Would be helpful if you could show me how to print it in the desired output format.

Thanks.

Answer

Took a lot of time to find a brute force solution.

import requests
from bs4 import BeautifulSoup 
import re
import csv
for x in xrange(44,47):
    EXAMNO ='15te12'+str(x)
    data = {"txtregno": EXAMNO,
    "cmbdegree": r"BTHEE~\BTHEE\result.mdb", # use raw strings
    "cmbexamno": "B",
    "dpath": r"\BTHEE\result.mdb",
    "dname": "BTHEE",
    "txtexamno": "B"}
    results_page = requests.post("http://result.pondiuni.edu.in/ResultDisp.asp", data=data).content
    soup = BeautifulSoup(results_page, 'html.parser').prettify()
    string=str(BeautifulSoup(results_page, 'html.parser'))
    regpa= "<!--Percentage / S.G.P.A : <b>(.+?) </b>&nbsp;&nbsp;&nbsp; -->"
    print (re.search(regpa,string,re.M|re.I )).group(1) 
    regname="<b>Name of the student : <b><font color=\"black\" size=\"3\">(.*)</font></b></b>"
    print (re.search(regname,string,re.M|re.I )).group(1)
    regsub="66%\"><font color=\"black\" face=\"arial\" size=\"2\">(.*)</font></td>"
    matches=(re.findall(regsub,string,re.M|re.I ))

    for i in xrange(len(matches)):
        regsubm=">"+matches[i]+"</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"2%\"><font color=\"black\" face=\"arial\" size=\"2\">..</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"7%\"><font color=\"black\" face=\"arial\" size=\"2\">[\xc2]?[\xa0]?[\xc2]?[\xa0]?-</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"1%\"><font color=\"black\" face=\"arial\" size=\"2\">-</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"5%\"><font color=\"black\" face=\"arial\" size=\"2\">-</font></td>\n<td align=\"center\" bgcolor=\"white\" width=\"5%\"><font color=\"black\" face=\"arial\" size=\"2\">(.*)</font>"
        matchesm=re.findall(regsubm,string,re.M)
        print matches[i],'--->',matchesm[0]