Spencer Spencer - 28 days ago 5
C++ Question

Reasonable assumptions about digit grouping

I've been working on a C++ class to extract arbitrarily-sized numbers from a stream and would like to leverage the number punctuation locale facets. Needless to say, std::num_get isn't going to extract my arbitrary-size number class; it only extracts builtin number types. But the extractor can get formatting information from the locale's numpunct and moneypunct facets.

The aspect I'm having the most trouble grappling with is digit grouping. I get that not all cultures group digits in threes, and some cultures have variably-sized number groups.

I've come across a blog (http://blogs.msdn.com/b/oldnewthing/archive/2006/04/17/577483.aspx) which shows some examples. Wikipedia (http://en.wikipedia.org/wiki/Decimal_mark#Examples_of_use) also has a table of examples.

The C and C++ standards have implemented a way to handle this in the locale mechanism. But the implementations leave semantic room for some very complicated situations. Recognizing a sequence of digits coming in with no end in sight, when we've told the recognizer to require correct digit grouping, is going to be extremely complicated.

So, can we cut down on the complexity by making some assumptions? These come from commonalities I've observed in the examples provided.

(Assumption 1) Only the least-significant group of digits can have a different size, and it can't be smaller than the other groups' size.

Failing assumption 1, we might fall back on:

(Assumption 2a) There are no more than a small number of different sizes. (Hopefully 2. I haven't seen any examples with more than two different sizes.)

(Assumption 2b) A less-significant digit group is always longer than all other groups for more-significant digits.

Answer

It bothered me that no-one ever addressed this, but recently I stumbled across the Unicode Consortium's Common Locale Data Repository (or CLDR)

Drilling down further, I found a summary chart (here) of the number formatting patterns in CLDR. This contains two basic grouping patterns:

  1. #,##,##0.###: Indian languages, traditional use
  2. # #### : Chinese and Japanese traditional (not in CLDR; I discovered this later)
  3. #,##0.###: everyone else

so even my naïve assumption #1 from several years ago gave far more latitude than was needed.

However, the number formatting chart does appear to cover only base 10 modern languages. For example, it does not include Hittite, Mayan, or Babylonian.

Finally, I don't believe std::num_get is adapted to non-positional notations (like Roman numerals).

Comments