Brad Solomon Brad Solomon - 3 years ago 109
Python Question

Ignore part of match in `re.split`

Input is a two-sentence string:

s = 'Sentence 1 here. This sentence contains 1 fl. oz. but is one sentence.'


I'd like to
.split
s
into sentences based on the logic that:


  • sentences end with one or more periods, exclamation marks, questions marks, or period+quotation mark

  • and are then followed by 1+ whitespace characters and a capitalized alpha character.



Desired result:

['Sentence 1 here.', 'This sentence contains 1 fl. oz. but is one sentence.']


Also okay:

['Sentence 1 here', 'This sentence contains 1 fl. oz. but is one sentence.']


But I currently chop off the 0th element of each sentence because the uppercase character is captured:

import re
END_SENT = re.compile(r'[.!?(.")]+[ ]+[A-Z]')
print(END_SENT.split(s))
['Sentence 1 here', 'his sentence contains 1 fl. oz. but is one sentence.']


Notice the missing T. How can I tell
.split
to ignore certain elements of the compiled pattern?

Answer Source
((?<=[.!?])|(?<=\.\")) +(?=[A-Z])

Try it here.

Although I would suggest the below to allow quotes to be followed by any of .!? to be a split condition

((?<=[.!?])|(?<=[.!?]\")) +(?=[A-Z])

Try it here.


Explanation

The common stuff in both +(?=[A-Z])

' +'    #One or more spaces(The actual splitting chars used.)
(?=     #START positive look ahead check if it followed by this, but do not consume
[A-Z]   #Any capitalized alphabet
)       #END positive look ahead

The conditions for what comes before the space
For Solution1

(     #GROUP START
(?<=  #START Positive look behind, Make sure this comes before but do not consume
[.!?] #any one of these chars should come before the splitting space
)     #END positive look behind
|     #OR condition this is also the reason we had to put all this in GROUP
(?<=  #START Positive look behind,
\.\"  #splitting space could precede by .", covering a condition that is not by the previous set of . or ! or ?
)     #END positive look behind
)     #END GROUP

For Solution2

(             #GROUP START
(?<=[.!?])    #Same as the previous look behind
|             #OR condition
(?<=[.!?]\")  #Only difference here is that we are allowing quote after any of . or ! or ? 
)             #GROUP END
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download