George J George J - 5 months ago 12
Python Question

Python. How to find all occurrences of matched substring?

I have a big string - html page. I need to find all names of flash drives,
i.e. I need to get content between double quotes:

data-name="USB Flash-drive Leef Fuse 32Gb">
. So I need a string between
data-name="
and
">
. Please, don't mention BeautifulSoup, I need to do it without BeautifulSoup and better without regular expressions, but regular expression are also accepted.

I tried to use this:

p = re.compile('(?<=")[^,]+(?=")')
result = p.match(html_str)
print(result)


but result is None.
But on regex101.com it worked:
enter image description here

Answer

py2: https://docs.python.org/2/library/htmlparser.html

py3: https://docs.python.org/3/library/html.parser.html


from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        # tag = 'sometag'
        for attr in attrs:
            # attr = ('data-name', 'USB Flash-drive Leef Fuse 32Gb')
            if attr[0] == 'data-name':
                print(attr[1])

parser = MyHTMLParser()
parser.feed('<sometag data-name="USB Flash-drive Leef Fuse 32Gb">hello  world</sometag>')

Output:

USB Flash-drive Leef Fuse 32Gb

I've added some comments to the code to show you what kind of data structure is returned by the parser.

It should be very easy to build from here.

Just feed in HTML, and it will parse it fine. Refer to the docs, and keep trying.