JustASimpleGuy JustASimpleGuy - 1 year ago 45
Python Question

how do i loop a re.search for the next data

I have a 2 set of data i crawled from a html table using regex expression


<div class = "info">
<div class="name"><td>random</td></div>
<div class="hp"><td>123456</td></div>
<div class="email"><td>random@mail.com</td></div>

<div class = "info">
<div class="name"><td>random123</td></div>
<div class="hp"><td>654321</td></div>
<div class="email"><td>random123@mail.com</td></div>


matchname = re.search('\<div class="name"><td>(.*?)</td>' , match3).group(1)
matchhp = re.search('\<div class="hp"><td>(.*?)</td>' , match3).group(1)
matchemail = re.search('\<div class="email"><td>(.*?)</td>' , match3).group(1)

so using the regex i can take out




so after saving this set of data into my database i want to save the next set how do i get the next set of data? i tried using findall then insert into my db but everything was in 1 line. I need the data to be in the db set by set.

New to python please comment on which part is unclear will try to edit


Using Racialz answer i can loop everything using the regex but only the last line of data is stored in the database how do i store all of the data instead of only the last 1

for thisMatch in re.findall(r"<td>(.+?)</td>.+?<td>(.+?)</td>.+?<td>(.+?)</td>.+?<td>(.+?)</td>", match3, re.DOTALL):
print(thisMatch[0], thisMatch[1], thisMatch[2])

sinfo = scrapyitem(name=thisMatch[0], hp=thisMatch[1], email=thisMatch[2])


Answer Source

You should not be parsing HTML with regex. It's just a mess, do it with BS4. Doing it the right way:

soup = BeautifulSoup(match3, "html.parser")
names = []
allTds = soup.find_all("td")
for i,item in enumerate(allTds[::3]):
    #            firstname   hp                email
    names.append((item.text, allTds[(i*3)+1].text, allTds[(i*3)+2].text))

And for the sake of answering the question asked I guess I'll include a horrible ugly regex that you should never use. ESPECIALLY because it's html, don't ever use regex for parsing html. (please don't use this)

for thisMatch in re.findall(r"<td>(.+?)</td>.+?<td>(.+?)</td>.+?<td>(.+?)</td>", match3, re.DOTALL):
    print(thisMatch[0], thisMatch[1], thisMatch[2])