J. Gogoi J. Gogoi - 6 months ago 12
Linux Question

Web Scraping ~ Python

I'm new to python and want some help regarding Web Scraping.

I have a Raspberry Pi3 with python on it and i want to extract some data from a web page using BeautifulSoap and write it to a text file with a time stamp, i keep my Pi 24x7 on so i want the python to repeat itself after a certain time interval so that i can later create a graph using those values.

Starting, i tried >

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://172.30.83.14/bsnlfup/usage.php")
bsObj = BeautifulSoup(html.read());
print(bsObj.td)"


And the output was something else-

<td align="right">
<a href="usage.php"><img alt="" border="0" height="152" src="images/fuph.jpg" width="100%"/></a>




the data was enclosed within a td tag, but there were many td tags in the page, so it didnt work and i dont know how to make it write the data to txt file.

The html source-

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="Expires" content="0">
<meta http-equiv="Pragma" content="No-cache">
<meta http-equiv="Cache-Control" content="no-cache">
<meta name="keywords" content="High-Speed, Broadband, IPTV, Internet, VoIP">
<meta name="description" content="Leading provider of high-speed communication services.">
<link rel="stylesheet" type="text/css" href="css/npm.css">
<title>BSNL BROADBAND</title>
<script language="Javascript" type="text/javascript" src="js/npmcommon.js"></script>
</head>
<body onload="TINIT();" topmargin="0" leftmargin="0" marginheight="0" marginwidth="0" bgcolor="#ffffff">
<div class="portalheader" align="left">
<table style="width: 100%;" border="0" cellspacing="0" cellpadding="0" bgcolor="white">
<tr>
<td align="right">
<a href="usage.php"><img src="images/fuph.jpg" alt="" border="0" height="152" width="100%"></a>
</td>
</tr>
<tr>
<td style="width: 100%; height: 10px; background-color: rgba(29, 117, 182, 1);"></td>
</tr>
</table>
</div>
<div class="serviceservlet">
<table style="width: 100%;" border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td style="width: 165px; vertical-align: top; background-color: rgb(f, f, f);">
<table border="0" cellpadding="0" cellspacing="0" width="165">
<tbody>
<tr>
<td colspan="3" height="48">
<br>
</td>
</tr>
</tbody>
</table>
<table border="0" cellpadding="0" cellspacing="0" width="165">
<tbody>
<tr>
<td style="width: 10px;">
<br>
</td>
</tr>
</tbody>
</table>
</td>
<td valign="top" width="100%">
<table style="width: 100%; height: 204px;" border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr></tr>
<tr>
<td colspan="2">
<font size="-1" face="Verdana, Arial, Helvetica, sans-serif">
<br>
<b>
You are logged in as
'abcdef_ghijkl@bsnl.in' at 117.000.000.000.
<br>
<br>
</b>
<br>
<br>
</font>
<!--Display the available metered time usage stats-->
<table border="0" width="100%" cellpadding="0" cellspacing="0">
<noscript>
<tr>
<td>
<a href="help.php#Java_script" target="new">
<font color="#FF0000">
<u>You must have JavaScript enabled in order to view usage stats.</u>
</font>
</a>
<br>
<br>
</td>
</tr>
</noscript>
<tr>
<td colspan="4">
<font color="#0A63BF">
<b> </b>
</font>
</td>
</tr>
<tr>
<td>
<i></i>
</td>
</tr>
</table>
<br>
<table border="0" width="100%" cellpadding="0" cellspacing="0">
<noscript>
<tr>
<td>
<a href="help.php#Java_script" target="new">
<font color="#FF0000">
<u>You must have JavaScript enabled in order to view usage stats.</u>
</font>
</a>
<br>
<br>
</td>
</tr>
</noscript>
<tr>
<td colspan="7">
<font color="#0A63BF">
<b> </b>
</font>
</td>
</tr>
<tr align="left">
<th>Download Remaining with High(FUP-original)Speed </th>
</tr>
<tr align="left">
<td>78.647 GB</td>
<td>
<a href="top_up.php?service=HS-I-H-50MB-90GB-10MB-B-M&amp;timeMetered=false"><img name="addBytes" src="images/btn1.png" border="0" alt="[AddBytes]" title="Top up volume quota"></a>
</td>
</tr>
<tr height="10">
<td>
<font color="#0A63BF"></font>
</td>
</tr>
</table>
<p>
<p></p>
</p>
</td>
<td style="width: 10px; background-color: rgb(f,f,f);">
<br>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</div>
<div class="portalfooter" align="left">
<td style="vertical-align: top;">
<table style="width: 100%; height: 86px;" border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td colspan="3" rowspan="1" style="background-color: rgb(f, f, f);">
<br>
</td>
</tr>
<tr valign="top">
<td style="width: 165px; height: 10px;" border="0">
<br>
</td>
<td class="npm10Text" height="10">
<br>
<br>
<p align="right">2014 BSNL . All rights reserved.</p>
<br>
<br>
</td>
<td align="right" style="vertical-align: middle;"></td>
</tr>
<tr>
<td colspan="3" rowspan="1" style="background-color: rgba(29, 117, 182, 1);">
<br>
</td>
</tr>
</tbody>
</table>
</td>
</div>
</body>
</html>


I wanted to export the data in tag just after "Download Remaining with High(FUP-original)Speed"

i.e I wanted to export the 78.647GB into a text file with a time stamp. And then repeat after a time interval and again add the exported to the same text file.

Answer

Ok, you got the soup part right but you are looking only at the first td element.

You need to find the exact one that you want. In this case, it's the first td inside the second tr with left alignment (there are several ways to get there, this was the one I found easier)

In this case you can use:

tr = bsObj.find_all('tr', align='left')[1]      # find_all returns all the elements in a list
td = tr.find('td')      # find returns only the first element in that block
text = td.get_text()    # we want the text, not the whole tag


I believe that you can work out the timestamp using the datetime module. datetime has an option to create a string the way you want, you can see all the options here

One example, without the miliseconds, would be:

now = datetime.datetime.now()
timestamp = now.strftime('%d/%m - %Hh%M')  # Note that you can add your own text with the time directives.


To write the output you have several options, you can start by looking here


Finally, the easiest way to repeat every X seconds is to put everything in a while loop with a sleep(X) in the end. Let's pretend you created a magic function with all the scrap and file I/O and that you want to run every 60s.

You have two options now:

1) An infinite loop that keeps going no matter what (you need to manually break it with ctrl+c or by itself when an error occurs.

while True:
    do_your_magic()
    sleep(60)

2) A slightly better option would be to find out when the download is over (i.e. the remaining is zero or the page changes), but this depends on how your page source would behave when the download is over. For that you could have your magic function returning False while downloading and True when it's done and your loop could become:

while True:
    done = do_your_magic()
    if done:
        break
    sleep(60)

or, more directly:

while not do_your_magic():
    sleep(60)  # This runs your magic and breaks when it's done.


Have a try, if you need more help just tell us how far you've got and what you think you're missing...

Comments