Christoffer Christoffer - 1 year ago 47
Ruby Question

Alter regex markup to not separate float numbers (like 2.0)

I was looking for a solution to a regex problem in Rails I had and an answer on a separate question lead me 90% of the path to the answer. Basically, what I would like to do is to have a ruby/rails script that will format a messy text in terms of capitalizing every letter after a "./,/!/?". This code by "Mark S"

ng = Nokogiri::HTML.fragment("<p>hello, how are you? oh, that's nice! i am glad you are fine. i am too.<br />i am glad to have met you.</p>")
ng.traverse{|n| (n.content = n.content.gsub(/(.*?)([\.|\!|\?])/) { " #{$1.strip.capitalize}#{$2}" }.strip) if n.text?}

The only issue I have with this code, and it is a big issue, is that the code adds a space in between float numbers like "2.0", making a text like:

there is a cat in the has a 2.0 inch tail!
isn't that awesome?!I think so.


There is a cat i the hat. It has a 2. 0 inch tail!
Isn't that awesome?! I think so.

where I obviously want it to be:

There is a cat i the hat. It has a 2.0 inch tail!
Isn't that awesome?! I think so.

Any suggestions on how to alter this text, for example so that any "." will be ignored by this code?

Answer Source

It seems you want to capitalize any lowercase letter at the beginning of the string or after ., !, or ?.


s.gsub(/(\A|[.?!])(\p{Ll})/) { Regexp.last_match(1).length > 0 ? "#{$1} #{$2.capitalize}" : "#{$2.capitalize}" }

See the Ruby demo

Pattern details:

  • (\A|[.?!]) - Group 1 capturing the start of string location (empty string) or a ., ?, or !
  • (\p{Ll}) - Group 2 capturing any Unicode lowercase letter

Inside the replacement, we check if Group 1 value is not empty, and if it is, we just return the capitalized letter. Else, return the punctuation, a space, and the capitalized letter.

NOTE: However, there is a problem with abbreviations (as usual in these cases), like i.e., e.g., etc. Then there are words like iPhone, iCloud, eSklep, and so on.