JJC JJC - 13 days ago 10
C++ Question

Why is reading lines from stdin much slower in C++ than Python?

I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Python code. Since my C++ is rusty and I'm not yet an expert Pythonista, please tell me if I'm doing something wrong or if I'm misunderstanding something.




(tl;dr answer: include the statement: cin.sync_with_stdio(false) or just use fgets instead.

tl;dr results: scroll all the way down to the bottom of my question and look at the table.)




C++ code:

#include <iostream>
#include <time.h>

using namespace std;

int main() {
string input_line;
long line_count = 0;
time_t start = time(NULL);
int sec;
int lps;

while (cin) {
getline(cin, input_line);
if (!cin.eof())
line_count++;
};

sec = (int) time(NULL) - start;
cerr << "Read " << line_count << " lines in " << sec << " seconds." ;
if (sec > 0) {
lps = line_count / sec;
cerr << " LPS: " << lps << endl;
} else
cerr << endl;
return 0;
}

//Compiled with:
//g++ -O3 -o readline_test_cpp foo.cpp


Python Equivalent:

#!/usr/bin/env python
import time
import sys

count = 0
start = time.time()

for line in sys.stdin:
count += 1

delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
lines_per_sec = int(round(count/delta_sec))
print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec,
lines_per_sec))


Here are my results:

$ cat test_lines | ./readline_test_cpp
Read 5570000 lines in 9 seconds. LPS: 618889

$cat test_lines | ./readline_test.py
Read 5570000 lines in 1 seconds. LPS: 5570000


Edit: I should note that I tried this both under OS-X (10.6.8) and Linux 2.6.32 (RHEL 6.2). The former is a macbook pro, the latter is a very beefy server, not that this is too pertinent.

Edit 2: (Removed this edit, as no longer applicable)

$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP: Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP: Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP: Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP: Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP: Read 5570001 lines in 10 seconds. LPS: 557000
Python:Read 5570000 lines in 1 seconds. LPS: 5570000


Edit 3:

Okay, I tried J.N.'s suggestion of trying having python store the line read: but it made no difference to python's speed.

I also tried J.N.'s suggestion of using scanf into a char array instead of getline into a std::string. Bingo! This resulted in equivalent performance for both python and c++. (3,333,333 LPS with my input data, which by the way are just short lines of three fields each, usually about 20 chars wide, though sometimes more).

Code:

char input_a[512];
char input_b[32];
char input_c[512];
while(scanf("%s %s %s\n", input_a, input_b, input_c) != EOF) {
line_count++;
};


Speed:

$ cat test_lines | ./readline_test_cpp2
Read 10000000 lines in 3 seconds. LPS: 3333333
$ cat test_lines | ./readline_test2.py
Read 10000000 lines in 3 seconds. LPS: 3333333


(Yes, I ran it several times.) So, I guess I will now use scanf instead of getline. But, I'm still curious if people think this performance hit from std::string/getline is typical and reasonable.

Edit 4 (was: Final Edit / Solution):

Adding:
cin.sync_with_stdio(false);

Immediately above my original while loop above results in code that runs faster than Python.

New performance comparison (this is on my 2011 Macbook Pro), using the original code, the original with the sync disabled, and the original python, respectively, on a file with 20M lines of text. Yes, I ran it several times to eliminate disk caching confound.

$ /usr/bin/time cat test_lines_double | ./readline_test_cpp
33.30 real 0.04 user 0.74 sys
Read 20000001 lines in 33 seconds. LPS: 606060
$ /usr/bin/time cat test_lines_double | ./readline_test_cpp1b
3.79 real 0.01 user 0.50 sys
Read 20000000 lines in 4 seconds. LPS: 5000000
$ /usr/bin/time cat test_lines_double | ./readline_test.py
6.88 real 0.01 user 0.38 sys
Read 20000000 lines in 6 seconds. LPS: 3333333


Thanks to @Vaughn Cato for his answer! Any elaboration people can make or good references people can point to as to why this sync happens, what it means, when it's useful, and when it's okay to disable would be greatly appreciated by posterity. :-)

Edit 5 / Better Solution:

As suggested by Gandalf The Gray below, gets is even faster than scanf or the unsynchronized cin approach. I also learned that scanf and gets are both UNSAFE and should NOT BE USED due to potential of buffer overflow. So, I wrote this iteration using fgets, the safer alternative to gets. Here are the pertinent lines for my fellow noobs:

char input_line[MAX_LINE];
char *result;

//<snip>

while((result = fgets(input_line, MAX_LINE, stdin )) != NULL)
line_count++;
if (ferror(stdin))
perror("Error reading stdin.");


Now, here are the results using an even larger file (100M lines; ~3.4GB) on a fast server with very fast disk, comparing the python, the unsynced cin, and the fgets approaches, as well as comparing with the wc utility. [The scanf version segfaulted and I don't feel like troubleshooting it.]:

$ /usr/bin/time cat temp_big_file | readline_test.py
0.03user 2.04system 0:28.06elapsed 7%CPU (0avgtext+0avgdata 2464maxresident)k
0inputs+0outputs (0major+182minor)pagefaults 0swaps
Read 100000000 lines in 28 seconds. LPS: 3571428

$ /usr/bin/time cat temp_big_file | readline_test_unsync_cin
0.03user 1.64system 0:08.10elapsed 20%CPU (0avgtext+0avgdata 2464maxresident)k
0inputs+0outputs (0major+182minor)pagefaults 0swaps
Read 100000000 lines in 8 seconds. LPS: 12500000

$ /usr/bin/time cat temp_big_file | readline_test_fgets
0.00user 0.93system 0:07.01elapsed 13%CPU (0avgtext+0avgdata 2448maxresident)k
0inputs+0outputs (0major+181minor)pagefaults 0swaps
Read 100000000 lines in 7 seconds. LPS: 14285714

$ /usr/bin/time cat temp_big_file | wc -l
0.01user 1.34system 0:01.83elapsed 74%CPU (0avgtext+0avgdata 2464maxresident)k
0inputs+0outputs (0major+182minor)pagefaults 0swaps
100000000


Recap (lines per second):
python: 3,571,428
cin (no sync): 12,500,000
fgets: 14,285,714
wc: 54,644,808


As you can see, fgets is better but still pretty far from wc performance; I'm pretty sure this is due to the fact that wc examines each character without any memory copying. I suspect that, at this point, other parts of the code will become the bottleneck, so I don't think optimizing to that level would even be worthwhile, even if possible (since, after all, I actually need to store the read lines in memory).

Also note that a small tradeoff with using a char * buffer and fgets vs unsynced cin to string is that the latter can read lines of any length, while the former requires limiting input to some finite number. In practice, this is probably a non-issue for reading most line-based input files, as the buffer can be set to a very large value that would not be exceeded by valid input.

This has been educational. Thanks to all for your comments and suggestions.

Edit 6:

As suggested by J.F. Sebastian in the comments below, the GNU wc utility uses plain C read() (within the safe-read.c wrapper) to read chunks (of 16k bytes) at a time and count new lines. Here's a python equivalent based on J.F.'s code (just showing the relevant snippet that replaces the python for loop:

BUFFER_SIZE = 16384
count = sum(chunk.count('\n') for chunk in iter(partial(sys.stdin.read, BUFFER_SIZE), ''))


The performance of this version is quite fast (though still a bit slower than the raw c wc utility, of course):

$ /usr/bin/time cat temp_big_file | readline_test3.py
0.01user 1.16system 0:04.74elapsed 24%CPU (0avgtext+0avgdata 2448maxresident)k
0inputs+0outputs (0major+181minor)pagefaults 0swaps
Read 100000000 lines in 4.7275 seconds. LPS: 21152829


Again, it's a bit silly for me to compare C++ fgets/cin and the first python code on the one hand to wc -l and this last python snippet on the other, as the latter two don't actually store the read lines but merely count newlines. Still, it's interesting to explore all the different implementations and think about the performance implications. Thanks again!

Edit 7: Tiny benchmark addendum and recap

For completeness, I thought I'd update the read speed for the same file on the same box with the original (synced) C++ code. Again, this is for a 100M line file on a fast disk. Here's the complete table now:

Implementation Lines per second
python (default) 3,571,428
cin (default/naive) 819,672
cin (no sync) 12,500,000
fgets 14,285,714
wc (not fair comparison) 54,644,808

Answer

By default, cin is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:

std::ios_base::sync_with_stdio(false);

Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the FILE* based stdio and iostreams often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:

int myvalue1;
cin >> myvalue1;
int myvalue2;
scanf("%d",&myvalue2);

If more input was read by cin than it actually needed, then the second integer value wouldn't be available for the scanf function, which has its own independent buffer. This would lead to unexpected results.

To avoid this, by default, streams are synchronized with stdio. One common way to achieve this is to have cin read each character one at a time as needed using stdio functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn't a big problem, but when you are reading millions of lines, the performance penalty is significant.

Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the sync_with_stdio method.