Learning is a mess Learning is a mess - 4 months ago 9
Python Question

Efficient regex parsing of html

I have a piece of Python code scrapping datapoints value from what seems to be a Javascript graph on a webpage. The data looks like:

...html/javascript...
{'y':765000,...,'x':1248040800000,...},
{'y':1020000,...,'x':1279144800000,...},
{'y':1105000,...,'x':1312754400000,...}
...html/javascript...


where the dots are plotting data I skipped.

To scrap the useful information - x/y datapoints coordinates - I used
regex
:

#first getting the raw x data
xData = re.findall("'x':\d+", htmlContent)
#now reading each value one by one
xData = [int(re.findall("\d+",x)[0]) for x in xData]


Same for the
y
values. I don't know if this terribly inefficient but it does not look pretty or very smart as a have many redundant calls to
re.findall
. Is there a way to do it in one pass? One pass for x and one pass for y?

Answer

You can do it a little bit easier:

htmlContent = """
...html/javascript...
{'y':765000,...,'x':1248040800000,...},
{'y':1020000,...,'x':1279144800000,...},
{'y':1105000,...,'x':1312754400000,...}
...html/javascript...
"""
# Get the numbers
xData = [int(_) for _ in re.findall("'x':(\d+)", htmlContent)]
print xData