doctorsherlock doctorsherlock - 3 months ago 10
Python Question

Python regex for removing scraping results according to substrings?

I have a written a scraper in python. I have a group of strings which i want to search on the page and from the result of that, i want to remove those results which contains words from another group of strings i have.

Here is the code -

def find_jobs(self, company, soup):
allowed = re.compile(r"Developer|Engineer|Designer|Admin|Manager|Writer|Executive|Lead|Analyst|Editor|"
r"Associate|Architect|Recruiter|Specialist|Scientist|Support|Expert|SSE|Head|"
r"Producer|Evangelist|Ninja", re.IGNORECASE)
not_allowed = re.compile(r"^responsibilities$|^description$|^requirements$|^experience$|^empowering$|^engineering$|^"
r"find$|^skills$|^recruiterbox$|^google$|^communicating$|^associated$|^internship$|^you$|^"
r"proficient$|^leadsquared$|^referral$|^should$|^must$|^become$|^global$|^degree$|^good$|^"
r"capabilities$|^leadership$|^services$|^expertise$|^architecture$|^hire$|^follow$|^jobs$|^"
r"procedures$|^conduct$|^perk$|^missed$|^generation$|^search$|^tools$|^worldwide$|^contact$|^"
r"question$|^intern$|^classes$|^trust$|^ability$|^businesses$|^join$|^industry$|^response$|^"
r"using$|^work$|^based$|^grow$|^provide$|^understand$|^header$|^headline$|^masthead$|^office$", re.IGNORECASE)

profile_list = set()
k = soup.body.findAll(text=allowed)
for i in k:
if len(i) < 60 and not_allowed.search(i) is None:
profile_list.add(i.strip().upper())
self.update_jobs(company, profile_list)


So I am facing a problem here. With the anchor tags in
not_allowed
, strings such as
//HEADLINE-BG
and
ABILITY TO LEAD & MENTOR A TEAM
are getting through, although i have the strings
headline
and
ability
in
not_allowed
. These are removed if i remove the anchor tags but then a string such as
SCALABILITY ENGINEER
does not get saved due to string
ability
in
not_allowed
.So being a newbie in regex, i am not sure how can i get this to work. Earlier i was using this -

def find_jobs(self, company, soup):
allowed = re.compile(r"Developer|Designer|Engineer|Admin|Manager|Writer|Executive|Lead|Analyst|Editor|"
r"Associate|Architect|Recruiter|Specialist|Scientist|Support|Expert|SSE|Head"
r"Producer|Evangelist|Ninja", re.IGNORECASE)
not_allowed = ['responsibilities', 'description', 'requirements', 'experience', 'empowering', 'engineering',
'find', 'skills', 'recruiterbox', 'google', 'communicating', 'associated', 'internship',
'proficient', 'leadsquared', 'referral', 'should', 'must', 'become', 'global', 'degree', 'good',
'capabilities', 'leadership', 'services', 'expertise', 'architecture', 'hire', 'follow',
'procedures', 'conduct', 'perk', 'missed', 'generation', 'search', 'tools', 'worldwide', 'contact',
'question', 'intern', 'classes', 'trust', 'ability', 'businesses', 'join', 'industry', 'response', 'you', 'using', 'work', 'based', 'grow', 'provide']

profile_list = set()
k = soup.body.findAll(text=allowed)
for i in k:
if len(i) < 60 and not any(x in i.lower() for x in not_allowed):
profile_list.add(i.strip().upper())
self.update_jobs(company, profile_list)


But this also omitted a string if a substring was present in
not_allowed
. Please can anyone help with this.

Answer

The regex

^ability$

Means "the line consists only of the word "ability". If you want sub-strings, just change to

ability

If you want to omit the word "ability", but not "disability", then use something like

\bability\b
Comments