Simone Carletti Simone Carletti - 5 months ago 23
Ruby Question

Parsing structured text in Ruby

There are several questions on SO about parsing structured text in Ruby, but none of them apply to my case.

I'm the author of the Ruby Whois library. The library includes several parsers to parse a WHOIS response and extract the properties from the content.

So far, I used two approaches:


  1. Regular expressions for base parsers (e.g. whois.aero)

  2. StringScanner for advanced parsers (e.g. whois.nic.it)



Regular expressions are not efficient because if I need to extract 15 properties, I need to scan the same response at least 15 times.

StringScanner is a nice library, but creating an efficient scanner is not that simple.

I was wondering if is there some other Ruby tools you suggest to implement a WHOIS record parser. I was reading about Treetop but because WHOIS records lack of a specification, I believe Treetop is not the right solution.

Any suggestion?

Answer

The obvious one is Ragel. whois records are pretty straightforward, have a limited set of key terms and such -- it should be straightforward. And Ragel parsers have proven very efficient.

Update As promised.

Okay, so why use Ragel? Basically, anything that can be described as a finite state machine can be described in Ragel, which then generates code for a highly efficient parser. This parser is much faster than a generalized regular expression engine, simply because it has a simpler program than the general parser.

Now, you could take this further, for example by using the ABNF Generator here. Then, your description to start with could be as simple as something like

WHOIS ::= RECORD*
RECORD ::= FIELDNAME ':' FIELDVALUE
FIELDVALUE ::= NAMESTRING | IPADDRESS | DOMAINNAME

(I make no claim that's particularly ABNF syntax, just a rough BNF.) The point is that you describe the parser in a more or less intuitive form, and let the generator make the exciting code part.