Jstuff Jstuff - 6 months ago 13
Python Question

Optional Parenthesis in Regex

Using the following string in python

1 - GENERAL 1

1.1 RELATED DOCUMENTS 1

1.2 SUMMARY 1

1.3 DEFINITIONS 1

1.4 INFORMATIONAL SUBMITTALS 2

1.5 GENERAL COORDINATION PROCEDURES 2

1.6 COORDINATION DRAWINGS 3

1.7 REQUESTS FOR INFORMATION (RFIs) 4

1.8 PROJECT MEETINGS 6


I'm trying to create a regit expression to put the section, title, and page number in 3 groups. So far I have

(\d)(\.|\d|\s|-)+\s+([^a-z]+?)\s+\d


which can handle all situation except the (RFIs). How can I grab this too?
Note: Sometimes the strings may contain subsection in lowercase that I do not want. This is why [^a-z] is present. Additionally, RFIs may not always be text in parenthesis.

Update:

END OF SECTION



Project No. 151219.00 012500 - 1 of 3 Substitution Procedures

Rev. 0, 07/23/15

Issued for Construction

Answer

There are mainly three parts contained in your string.

First is section which is mainly composed of digits followed by decimal and digits

Second is anything upto page number. This mainly starts from word

Third is page number in the last which is usually digits

Your regex contains too many alternations which are not required. So you can use this regex

^\s*(\b\d+(?:[.]\d+)?)\W+(.*?)\s*(\b\d+\b)$
    <---------------->   <--->   <------->
        Section         Content  Page Number

Regex Demo

If subsection can contain values like 1.1.1 etc., you can use

^\s*(\b\d+(?:[.]\d+)*)\W+(.*?)\s*(\b\d+\b)$

Regex Breakdown

\b is word boundary

\W is equivalent to [^\w] which in turn is [^A-Za-z0-9_] (Mind the ^ which signifies match anything except those in character class)

 ^ #start of string
 \s* #Match any spaces in starting
 (
  \b #word boundary
  \d+ #Match digits
  (?:[.]\d+)* #Non-capturing group to match . followed by digits any 
              #number of times(due to *). It matches after . like .1.1 etc
 ) 
  \W+ #Match any non word character
  (.*?) #Match anything upto page number given in next
  \s* #Match spaces if there
  (\b\d+\b) #Match page numbers in last(due to $).
 $ #End of string