ibramoh ibramoh - 21 days ago 5
Python Question

Links in python

I got a regex which is suitable for getting the Hyperlinks in a page source.

When I run this piece of code

import sys,re
import webpage_get

def print_links(page):

print '[+] print_links()'
links = re.findall(r'\<a.*href\=.*http\:.+',page)
links.sort()
print '[+]', str(len(links)), 'HyperLinks Found:'
a = open(r'C:\Users\noh\Desktop\ApplicationDevelopment\Second Course work\result.txt','w')
for link in links:
a.write(link)
a.close()

def main():
sys.argv.append('http://socrdlvideo.napier.ac.uk/~csn11118/CSN08115/index.html')
## sys.argv.append('http://www.napier.ac.uk/Pages/home.aspx')

if len(sys.argv) != 2:
print '[-] usage: webpage_getlinks URL'
return

page = webpage_get.wget(sys.argv[1])
print_links(page)

if __name__ == '__main__':
main()


The result will be similar to this:

href="http://www.rottentomatoes.com/m/star_wars/trailer/">Star Wars Trailer</a>


What I really need is just the link itself without the addition strings in both sides, for instance:

http://www.rottentomatoes.com/m/star_wars/trailer/


It would be great if you tell me how to get rid of the addition strings in both sides.

Answer

Try this regex:

(?<=href=\")http.*?\/(?=\")

Or this one:

http.*?\/(?=\")

Demo: https://regex101.com/r/YXw2y5/1

So, in your code, change this line:

links = re.findall(r'\<a.*href\=.*http\:.+',page)

to this:

links = re.findall(r'(?<=href=\")http.*?\/(?=\")',page)