Snow Sailor Snow Sailor - 26 days ago 20
C++ Question

Fastest way to read millions of integers from stdin C++?

I am working on a sorting project and I've come to the point where a main bottleneck is reading in the data. It takes my program about 20 seconds to sort 100,000,000 integers read in from stdin using

cin
and
std::ios::sync_with_stdio(false);
but it turns out that 10 of those seconds is reading in the data to sort. We do know how many integers we will be reading in (the count is at the top of the file we need to sort).

How can I make this faster? I know it's possible because a student in a previous semester was able to do counting sort in a little over 3 seconds (and that's basically purely read time).

The program is just fed the contents of a file with integers separated by newlines like
$ ./program < numstosort.txt


Thanks

Here is the relevant code:

std::ios::sync_with_stdio(false);
int max;
cin >> max;
short num;
short* a = new short[max];
int n = 0;
while(cin >> num) {
a[n] = num;
n++;
}

Answer

This will get your data into memory about as fast as possible, assuming Linux/POSIX running on commodity hardware. Note that since you apparently aren't allowed to use compiler optimizations, C++ IO is not going to be the fastest way to read data. As others have noted, without optimizations the C++ code will not run anywhere near as fast as it can.

Given that the redirected file is already open as stdin/STDIN_FILENO, use low-level system call/C-style IO. That won't need to be optimized, as it will run just about as fast as possible:

struct stat sb;
int rc = ::fstat( STDIN_FILENO, &sb );

// use C-style calloc() to get memory that's been
// set to zero as calloc() is often optimized to be
// faster than a new followed by a memset().
char *data = ::calloc( 1, sb.st_size + 1 );
size_t totalRead = 0UL;
while ( totalRead  < st.st_size )
{
    ssize_t bytesRead = ::read( STDIN_FILENO,
        data + totalRead, sb.st_size - totalRead );
    if ( bytesRead <= 0 )
    {
        break;
    }
    totalRead += bytesRead;
}

// data is now in memory - start processing it

That code will read your data into memory as one long C-style string. And the lack of compiler optimizations won't matter one bit as it's all almost bare-metal system calls.

Using fstat() to get the file size allows allocating all the needed memory at once - no realloc() or copying data around is necessary.

You'll need to add some error checking, and a more robust version of the code would check to be sure the data returned from fstat() actually is a regular file with an actual size, and not a "useless use of cat" such as cat filename | YourProgram, because in that case the fstat() call won't return a useful file size. You'll need to examine the sb.st_mode field of the struct stat after the call to see what the stdin stream really is:

::fstat( STDIN_FILENO, &sb );
...
if ( S_ISREG( sb.st_mode ) )
{
    // regular file...
}

(And for really high-performance systems, it can be important to ensure that the memory pages you're reading data into are actually mapped in your process address space. Performance can really stall if data arrives faster than the kernel's memory management system can create virtual-to-physical mappings for the pages data is getting dumped into.)

To handle a large file as fast as possible, you'd want to go multithreaded, with one thread reading data and feeding one or more data processing threads so you can start processing data before you're done reading it.

Comments