the_t_test_1 the_t_test_1 - 9 days ago 5
Python Question

Not including exact words and tags in regular expression to get date

I am using regex and the re python module.

I trying to just capture the date from the following code:

<div class="row">
<div class="small-12 columns">
<strong>
Date:
</strong>
December 18th 2015
</div>
</div>
</div>


I have the regular expression:

(((?!Date:)(?!\n)(.+)(<\/strong\>)(\n)(.+))(\S))


But it still gets back all of:

</strong>
December 20th 2016


I want to ditch the and whitespace and just get "December 20th 2016"

So I need to do something with the bit of the regular expression after (((?!Date:)(?!\n), i.e. this bit needs to change:

(.+)(<\/strong\>)(\n)(.+))(\S))


But I'm not sure what as I can't do negative lookahead (?!) with the .+ according to regexr.com

Any ideas to get back just "December 20th 2016"?

Answer

The

?: 

at the beginning of some groups means they are non-capturing group, that's what you need to use to avoid capturing unwanted stuff.

However, as Daniel Roseman said, you should probably use an HTML parser

Edit:

from re import findall
s = """        <div class="row">
            <div class="small-12 columns">
                <strong>
                    Date:
                </strong>
            December 18th 2015 
            </div>
        </div>
        </div>"""
res = findall(r'(?:Date:)(?:\n)(?:.+)(?:\n)(?:\s+)(.+)', s)
print(res)

This prints ['December 18th 2015 '] ( python 3.5.2)

Comments