Defoe Defoe - 1 month ago 9
Ruby Question

Ruby regex eliminate new line until . or ? or capital letter

I'd like to do the following with my strings:

line1= "You have a house\nnext to the corner."


Eliminate
\n
if the sentence doesn't finish in new line after dot or question mark or capital letter, so the desired output will be in this case:

"You have a house next to the corner.\n"


So another example, this time with the question mark:

"You like baggy trousers,\ndon't you?


should become:

"You like baggy trousers, don't you?\n".


I've tried:

line1.gsub!(/(?<!?|.)"\n"/, " ")


(?<!?|.)
this immediately preceding \n there must NOT be either question mark(?) or a comma

But I get the following syntax error:

SyntaxError: (eval):2: target of repeat operator is not specified: /(?<!?|.)"\n"/


And for the sentences where in the middle of them there's a capital letter, insert a \n before that capital letter so the sentence:

"We were winning The Home Secretary played a important role."


Should become:

"We were winning\nThe Home Secretary played a important role."

Answer

NOTE: The answer is not meant to provide a generic way to remove unnecessary newline symbols inside sentences, it is only meant to serve OP purpose to only remove or insert newlines in specific places in a string.

Since you need to replace matches in different scenarios differently, you should consider a 2-step approach.

.gsub(/(?<![?.])\n/, ' ')

This one will replace all newlines that are not preceded with ? and . (as (?<![?.]) is a negative lookbehind failing the match if there is a subpattern match before the current location inside the string).

The second step is

.sub(/(?<!^) *+(?=[A-Z])/, '\n')

or

.sub(/(?<!^) *+(?=\p{Lu})/, '\n')

It will match 0+ spaces ( *+) (possessively, no backtracking into the space pattern) that are not at the beginning of the line (due to the (?<!^) negative lookbehind, replace ^ with \A to match the start of the whole string), and that is followed with a capital letter ((?=\p{Lu}) is a positive lookahead that requires a pattern to appear right after the current location to the right).

Comments