OopsUser OopsUser - 28 days ago 12
C++ Question

How to parse space-separated floats in C++ quickly?

I have a file with millions of lines, each line has 3 floats separated by spaces. It takes a lot of time to read the file, so I tried to read them using memory mapped files only to find out that the problem is not with the speed of IO but with the speed of the parsing.

My current parsing is to take the stream (called file) and do the following

float x,y,z;
file >> x >> y >> z;


Someone in Stack Overflow recommended to use Boost.Spirit but I couldn't find any simple tutorial to explain how to use it.

I'm trying to find a simple and efficient way to parse a line that looks like this:

"134.32 3545.87 3425"


I will really appreciate some help. I wanted to use strtok to split it, but I don't know how to convert strings to floats, and I'm not quite sure it's the best way.

I don't mind if the solution will be Boost or not. I don't mind if it won't be the most efficient solution ever, but I'm sure that it is possible to double the speed.

Thanks in advance.

Answer

If the conversion is the bottle neck (which is quite possible), you should start by using the different possiblities in the standard. Logically, one would expect them to be very close, but practically, they aren't always:

  • You've already determined that std::ifstream is too slow.

  • Converting your memory mapped data to an std::istringstream is almost certainly not a good solution; you'll first have to create a string, which will copy all of the data.

  • Writing your own streambuf to read directly from the memory, without copying (or using the deprecated std::istrstream) might be a solution, although if the problem really is the conversion... this still uses the same conversion routines.

  • You can always try fscanf, or scanf on your memory mapped stream. Depending on the implementation, they might be faster than the various istream implementations.

  • Probably faster than any of these is to use strtod. No need to tokenize for this: strtod skips leading white space (including '\n'), and has an out parameter where it puts the address of the first character not read. The end condition is a bit tricky, your loop should probably look a bit like:

    char* begin;    //  Set to point to the mmap'ed data...
                    //  You'll also have to arrange for a '\0'
                    //  to follow the data.  This is probably
                    //  the most difficult issue.
    char* end;
    errno = 0;
    double tmp = strtod( begin, &end );
    while ( errno == 0 && end != begin ) {
        //  do whatever with tmp...
        begin = end;
        tmp = strtod( begin, &end );
    }

If none of these are fast enough, you'll have to consider the actual data. It probably has some sort of additional constraints, which means that you can potentially write a conversion routine which is faster than the more general ones; e.g. strtod has to handle both fixed and scientific, and it has to be 100% accurate even if there are 17 significant digits. It also has to be locale specific. All of this is added complexity, which means added code to execute. But beware: writing an efficient and correct conversion routine, even for a restricted set of input, is non-trivial; you really do have to know what you are doing.

EDIT:

Just out of curiosity, I've run some tests. In addition to the afore mentioned solutions, I wrote a simple custom converter, which only handles fixed point (no scientific), with at most five digits after the decimal, and the value before the decimal must fit in an int:

double
convert( char const* source, char const** endPtr )
{
    char* end;
    int left = strtol( source, &end, 10 );
    double results = left;
    if ( *end == '.' ) {
        char* start = end + 1;
        int right = strtol( start, &end, 10 );
        static double const fracMult[] 
            = { 0.0, 0.1, 0.01, 0.001, 0.0001, 0.00001 };
        results += right * fracMult[ end - start ];
    }
    if ( endPtr != nullptr ) {
        *endPtr = end;
    }
    return results;
}

(If you actually use this, you should definitely add some error handling. This was just knocked up quickly for experimental purposes, to read the test file I'd generated, and nothing else.)

The interface is exactly that of strtod, to simplify coding.

I ran the benchmarks in two environments (on different machines, so the absolute values of any times aren't relevant). I got the following results:

Under Windows 7, compiled with VC 11 (/O2):

Testing Using fstream directly (5 iterations)...
    6.3528e+006 microseconds per iteration
Testing Using fscan directly (5 iterations)...
    685800 microseconds per iteration
Testing Using strtod (5 iterations)...
    597000 microseconds per iteration
Testing Using manual (5 iterations)...
    269600 microseconds per iteration

Under Linux 2.6.18, compiled with g++ 4.4.2 (-O2, IIRC):

Testing Using fstream directly (5 iterations)...
    784000 microseconds per iteration
Testing Using fscanf directly (5 iterations)...
    526000 microseconds per iteration
Testing Using strtod (5 iterations)...
    382000 microseconds per iteration
Testing Using strtof (5 iterations)...
    360000 microseconds per iteration
Testing Using manual (5 iterations)...
    186000 microseconds per iteration

In all cases, I'm reading 554000 lines, each with 3 randomly generated floating point in the range [0...10000).

The most striking thing is the enormous difference between fstream and fscan under Windows (and the relatively small difference between fscan and strtod). The second thing is just how much the simple custom conversion function gains, on both platforms. The necessary error handling would slow it down a little, but the difference is still significant. I expected some improvement, since it doesn't handle a lot of things the the standard conversion routines do (like scientific format, very, very small numbers, Inf and NaN, i18n, etc.), but not this much.

Comments