Dr. UK Dr. UK - 4 months ago 8
HTML Question

Python How to get a specific code in website using re

I'm trying to make python challange.
http://www.pythonchallenge.com/pc/def/ocr.html
Ok. I know, I can just copy paste the code from source to a txt file and make things like that but I want to take it from net for improving myself. (+ I have done it already) I have tried

re.findall(r"<!--(.*?)-->,html)


But it doesn't get anything.
If you want my full code is here:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall(r"<!--(.*)-->",str(x.content))
print codes


Also I tried making it like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall("<!--\n(.*)\n-->",str(x.content))
print codes


Now it finds the text but still can't get that mess :(

Answer

Not sure what you mean by "that mess". You should include all of the details of the challenge within this post, instead of linking users to the pythonchallenge post.

Either way, if you set the regex to be in single-line mode, //s, then the dot character, ., should match newlines, /n, as well. This obviates the \n(.+)\n construction in your regex, which may solve your problem.

Here's a link to a working regex example.

Here is the modified python 2.7 code:

#!/usr/bin/python
import requests, re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall("<!--(.*?)-->", str(x.content), re.S)
print codes[1]

Note the re.S, (.*?), and codes[1] modifications.

  • re.S is python's flag for //s
  • (.*?) makes the * quantifier non-greedy
  • codes[1] prints the second set of content found within HTML comments (since findall(..) matches 2 and returns an array of both sets).