Jp Houten Jp Houten - 20 days ago 5
C# Question

Best Practice; Programmatically detecting keywords from text

I am trying to extract numbers from a string (email) based on keywords.
There are a couple of difficulties here;


  • The numbers we are looking for in our system are Always 8 characters, but the senders could be neglecting the preprocessing "0" and instead of sending 01234567 they will send us 1234567.

  • There are other numbers that could be matched as valid numbers, like Phonenumbers, and are known in our system, therefore we have decided to detect preprocessing keywords like "casenumber: " and other variants.

  • last but not least, the sender could send "casenumber: 1234567" but he could also send "casenumbers: 1234567, 7654321" or any variant of that. (devider ; or , or . or : etc.)



An example text:

Hi!

Hereby I would like to confirm that I will be present at the meeting about casenumber: 1234567 and 7654321.
Can you confirm that you have received this email?

Kind regards,
Random person


What I have tried to use is a regex match that searches for a list of keywords, including "casenumber:" and than adding after that all possible solutions, but this only works for 1 case number, the second one or third and so on will not be found.

Code language used: C#

Current code:

Regex.Matches(checkString, keyword + @"[ +;:,.\r\n\t]*[BL0123456789][0-9]+", RegexOptions.IgnoreCase )


This my current regex, it uses Regex.Matches and checks generally on global. It does match when the text has "casenumber: 12345678 and casenumber: 87654321" but not when its comma seperated.

Answer

I've tested my variation of your original RegEx and have adapted it to work with the dividers, even the Oxford comma:

Regex.Matches(checkstring, keyword + @"([ +;:,.\r\n\t]*[BL0123456789][0-9]+(([ +;:,.\r\n\t]|and)+[BL0123456789][0-9]+)*)", RegexOptions.IgnoreCase);
Comments