I have about 15,000 files I need to parse which could contain one or more strings/numbers from a list I have. I need to separate the files with matching strings.
Given a string: 3423423987, it could appear independently as "3423423987", or as "3423423987_1" or "3423423987_1a", "3423423987-1a", but it could also be "2133423423987". However, I only want to detect the matching sequence where it is not a part of another number, only when it has a suffix of some sort.
So 3423423987_1 is acceptable, but 13423423987 is not.
I'm having trouble with regex, haven't used it much to be honest.
Simply speaking, if I simulate this with a list of possible positives and negatives, I should get 7 hits, for the given list. I would like to extract the text till the end of the word, so that I can record that later.
Here's my code:
def check_text_for_string(text_to_parse, string_to_find):
matches = 
pattern = r"%s_?[^0-9,a-z,A-Z]\W"%string_to_find
return re.findall(pattern, text_to_parse)
if __name__ =="__main__":
word_to_match = "3423423987"
possible_word_list = [
"3423423987_1 the cake is a lie", #Match
"3423423987sdgg call me Ishmael", #Not a match
"3423423987 please sir, can I have some more?", #Match
"3423423987 ", #Match
"adsgsdzgxdzg adsgsdag\t3423423987\t", #Match
"1233423423987", #Not a match
"A3423423987", #Not a match
"3423423987.0", #Not a match
"342342398743635645" #Not a match
print("%d words in sample list."%len(possible_word_list))
print("Only 7 should match.")
matches = check_text_for_string("\n".join(possible_word_list), word_to_match)
It seems you just want to make sure the number is not matched as part of a, say, float number. You then need to use lookarounds, a lookbehind and a lookahead to disallow dots with digits before and after.
See the regex demo
To also match the "prefixes" (or, better call them "suffixes" here), you need to add something like
\S* (zero or more non-whitespaces) or
(?:[_-]\w+)? (an optional sequence of a
_ followed with 1+ word chars) at the end of the pattern.
(?<!\d\.)- fail the match if we have a digit and a dot before the current position
(?:\b|_)- either a word boundary or a
_(we need it as
_is a word char)
3423423987- the search string
(?!\.\d)- fail the match if a dot + digit is right after the current position.
pattern = r"(?<!\d\.)(?:\b|_)%s(?:\b|_)(?!\.\d)"%string_to_find
See the Python demo
If there can be floats like
Text with .3423423987 float value, you will need to also add another lookbehind
(?<!\.) after the first one: