Matthew Swart Matthew Swart - 29 days ago 5
Javascript Question

Parsing .js page python

I have a webpage http://timetable.ait.ie/js/filter.js and I seriously need to parse this page. I have been using BeautifulSoup over the past few days to parse html pages and I really get what I am doing there but this .js file is killing me.

At the moment I am using the following code:

import urllib
page = urllib.urlopen("http://timetable.ait.ie/js/filter.js")
pageInfo = page.read()


and it is returning a string with the whole file of 18283 lines of code. In the code I am trying to get the staff names towards the bottom, there is an array:

staffarray[373][0] = "BRADY, DAMIEN";
staffarray[373][1] = "SCI";
staffarray[373][2] = "BRADY001608";


I need the value from [0] and from [1] and then build a database with these values that I can reference later.

I have tried regex to find the staffarray but I am completely frustrated trying to get this information. Is there anyone that can help me please.

Answer

If you have problem with regex then use standard string functions and slicing.

First split code into lines and later search staffarray[ and [0] or [1]. Lastly use slicing.

import urllib

req = urllib.urlopen("http://timetable.ait.ie/js/filter.js")
lines = req.read().split('\n')

for x in lines:
    if 'staffarray[' in x:
        if '[0] = ' in x:
            start = x.find('"')+1
            end = -3
            print '0', x[start:end]
        elif '[1] = ' in x:
            start = x.find('"')+1
            end = -3
            print '1', x[start:end]