ann_dos ann_dos - 5 months ago 11
Python Question

How do I get strings which starts with number as start of line and ends with 5 digit number

I have text like:

asf aSD ikugfr jddc ghddfj gjn dfxg
sdgal fghfh 16 rgjodrisgj frth fghsdf,
dfghdf dfhgdh gho h ghdof 67676

szdgfads
2 adf dojosd hsh fghs,
zfgdf dhgdzsfb dfgdz,
dzgdzfvg 47564

asdgasdg asdg
4334 ersga errr ertgerfd ertera erers qereadf erfesfdc wefadfe,
sfsdgfg-43647


I need to extract all string in which start of the line is number and ends with 5 digits. There can be multiple lines in between.

2 adf dojosd hsh fghs,
zfgdf dhgdzsfb dfgdz,
dzgdzfvg 47564

4334 ersga errr ertgerfd ertera erers qereadf erfesfdc wefadfe,
sfsdgfg-43647


I have tried with this regex but failed to do so. Its taking exactly two line, not single lines or more than two lines together.

regex = ^[0-9](.*)(?<=,)*\n?(.*\D\d{5}\D)

Answer

Your ^[0-9](.*)(?<=,)*\n?(.*\D\d{5}\D) regex matches the start of a string/line, then 1 digit, then 0+ any characters (except newlines if DOTALL mode is not used), then (?<=,)* is supposed to check 0+ times if the preceding character is a comma (which does not make much sense though Python does not mind it), then \n? matches 1 or 0 newlines, .* matches 0+ any chars except newline, \D matches a non-digit, \d{5} matches 5 digits, and \D again matches a non-digit. Yucky. I do not think it can work for any matches that contain more than 3 lines (note that \D matches a newline), and it will never match a valid match at the end of the string as the last \D requires a character after the last 5 digits.

You may use

re.compile(r'^\d.*?\b\d{5}$', re.M|re.DOTALL)

See the regex demo

You need to use a DOTALL modifier with the pattern so that . could match a newline, and MULTILINE modifier for the ^ and $ to match start/end of the line. The \b will not allow matching strings with more than 5 digits at the end of the line.

Use with re.findall, see demo:

import re
p = re.compile(r'^\d.*?\b\d{5}$', re.MULTILINE | re.DOTALL)
test_str = "asf aSD  ikugfr jddc ghddfj gjn dfxg \nsdgal fghfh 16 rgjodrisgj frth fghsdf,\ndfghdf dfhgdh gho h ghdof 67676\n\nszdgfads\n2 adf dojosd hsh fghs, \nzfgdf dhgdzsfb dfgdz,\ndzgdzfvg 47564\n\nasdgasdg asdg\n4334 ersga errr ertgerfd ertera erers qereadf erfesfdc wefadfe, \nsfsdgfg-43647"
print(p.findall(test_str))
# => ['2 adf dojosd hsh fghs, \nzfgdf dhgdzsfb dfgdz,\ndzgdzfvg 47564', '4334 ersga errr ertgerfd ertera erers qereadf erfesfdc wefadfe, \nsfsdgfg-43647']