Vigo Schwartz Vigo Schwartz - 9 months ago 72
Ruby Question

Regex parse using Nokogiri

Using Nokogiri, I need to parse a block given:

<div class="some_class">
12 AB / 4+ CD
2,600 Dollars

I need to get the
values if they exist.

ab = p.css(".some_class").text[....some regex....]
cd = p.css(".some_class").text[....some regex....]
dollars = p.css(".some_class").text[....some regex....]

Is that correct? If so, can someone help me with a regex to parse the


To get a better answer you would have to clarify exactly what format the AB, CD and Dollar values take but here is a solution based on the example given. It uses a regexp grouping () to capture the information we're interested in. (see the bottom of the answer for more details)

text = p.css(".some_class").text

# one or more digits followed by a space followed by AB, capture the digits
ab = text.match(/(\d+) AB/).captures[0] # => "12"

# one of more non digits followed by a literal + followed by CD
cd = text.match(/(\d+\+) CD/).captures[0] # => "4+"

# digits or commas followed by "Dollars"
dollars = text.match(/([\d,]+) Dollars/).captures[0] # => "2,600"

Note that if there is no match then String#match returns nil so if the values might not exist you would need a check e.g.

if match = text.match(/([\d,]+) Dollars/)
  dollars = match.captures[0]

Additional explanation of captures

To match the amount of AB we need a pattern /\d+ AB/ to identify the right part of the text. However, we're really only interested in the numeric part so we surround that with brackets so that we can extract it. e.g.

irb(main):027:0> match = text.match(/(\d+) AB/)
=> #<MatchData:0x2ca3440>           # the match method returns MatchData if there is a match, nil if not
irb(main):028:0> match.to_s         # match.to_s gives us the entire text that matched the pattern
=> "12 AB"
irb(main):029:0> match.captures     
=> ["12"]
# match.captures gives us an array of the parts of the pattern that were enclosed in ()
# in our example there is just 1 but there could be multiple
irb(main):030:0> match.captures[0]
=> "12"                             # the first capture - the bit we want

Take a look at the documentation for MatchData, in particular the captures method for more details.