Xploit Xploit - 1 month ago 12
Python Question

Python : Web Scraping Specific Keywords

My Question shouldn't be too hard to answer, The problem im having is im not sure how to scrape a website for specific keywords.. I'm quite new to Python.. So i know i need to add in some more details , Firstly what i dont want to do is use Beautiful Soup or any of those libs, im using lxml and requests, What i do want to do is ask the user for an input for a website and once its provided , Send a request to the provided URL, once the request is made i want it to grab all the html which i believe ive done using html.fromstring(site.content) so all thats been done the problem im having is i want it to find any link or text with the ending '.swf' and print it below that.. Anyone know any way of doing this?

def ScrapeSwf():
flashSite = raw_input('Please Provide Web URL : ')
print 'Sending Requests...'
flashReq = requests.get(flashSite)
print 'Scraping...'
flashTree = html.fromstring(flashReq.content)
print ' Now i want to search the html for the swf link in the html'
print ' And Display them using print probablly with a while condition'


Something like that .. Any help is highly appreciated

Answer

Here goes my attempt:

import requests [1]
response = requests.get(flashSite) [2]
myPage = response.content [3]
for line in myPage.splitlines(): [4]
    if '.swf' in line: [5]
        start = line.find('http') [6]
        end = line.find('.swf') + 4 [7]
        print line[start:end] [8]

Explanation:

1: Import the request module. I couldn't really figure out a way to get what I needed out of lxml, so I just stuck with this.

2: Send a HTTP GET method to whatever site that has the Flash file

3: Save its contents to a variable

Yes, I realize you could condense lines 2 and 3, I just did it this way because I felt it makes a bit more sense to me.

4: Now iterating through each line in the code, going line by line.

5: Check to see if '.swf' is in that line

Lines 6 through 8 demonstrate the string slicing method that @GazDavidson mentioned in his answer. The reason I add 4 in line 7 is because '.swf' is 4 characters long.

You should be able to (roughly) get the result that provides a link to the SWF file.

Comments