user2346536 user2346536 - 4 months ago 28x
Python Question

Why is output different from shell output of same command?

I am using
for some automated testing. Mostly to automate doing:

dummy.exe < file.txt > foo.txt
diff file.txt foo.txt

If you execute the above redirection in a shell, the two files are always identical. But whenever
is too long, the below Python code does not return the correct result.

This is the Python code:

import subprocess
import sys

def main(argv):

exe_path = r'dummy.exe'
file_path = r'file.txt'

with open(file_path, 'r') as test_file:
stdin =
p =[exe_path], input=stdin, stdout=subprocess.PIPE, universal_newlines=True)
out = p.stdout.strip()
err = p.stderr
if stdin == out:
print('failed: ' + out)

if __name__ == "__main__":

Here is the C++ code in

#include <iostream>

int main()
int size, count, a, b;
std::cin >> size;
std::cin >> count;

std::cout << size << " " << count << std::endl;

for (int i = 0; i < count; ++i)
std::cin >> a >> b;
std::cout << a << " " << b << std::endl;

can be anything like this:

1 100000
0 417
0 842
0 919

The second integer on the first line is the number of lines following, hence here
will be 100,001 lines long.

Question: Am I misusing ?


I'll start with a disclaimer: I don't have Python 3.5 (so I can't use the run function), and I wasn't able to reproduce your problem on Windows (Python 3.4.4) or Linux (3.1.6). That said...

Problems with subprocess.PIPE and Family

The docs say that it's just a front-end for the old subprocess.Popen-and-communicate() technique. The subprocess.Popen.communicate docs warn that:

The data read is buffered in memory, so do not use this method if the data size is large or unlimited.

This sure sounds like your problem. Unfortunately, the docs don't say how much data is "large", nor what will happen after "too much" data is read. Just "don't do that, then".

The docs for go into a little more detail (emphasis mine)...

Do not use stdout=PIPE or stderr=PIPE with this function. The child process will block if it generates enough output to a pipe to fill up the OS pipe buffer as the pipes are not being read from. do the docs for subprocess.Popen.wait:

This will deadlock when using stdout=PIPE or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use Popen.communicate() when using pipes to avoid that.

That sure sounds like Popen.communicate is the solution to this problem, but communicate's own docs say "do not use this method if the data size is large" --- exactly the situation where the wait docs tell you to use communicate. (Maybe it "avoid(s) that" by silently dropping data on the floor?)

Frustratingly, I don't see any way to use a subprocess.PIPE safely, unless you're sure you can read from it faster than your child process writes to it.

On that note...

Alternative: tempfile.TemporaryFile

You're holding all your data in memory... twice, in fact. That can't be efficient, especially if it's already in a file.

If you're allowed to use a temporary file, you can compare the two files very easily, one line at a time. This avoids all the subprocess.PIPE mess, and it's much faster, because it only uses a little bit of RAM at a time. (The IO from your subprocess might be faster, too, depending on how your operating system handles output redirection.)

Again, I can't test run, so here's a slightly older Popen-and-communicate solution (minus main and the rest of your setup):

import io
import subprocess
import tempfile

def are_text_files_equal(file0, file1):
    Both files must be opened in "update" mode ('+' character), so
    they can be rewound to their beginnings.  Both files will be read
    until just past the first differing line, or to the end of the
    files if no differences were encountered.
    for line0, line1 in zip(file0, file1):
        if line0 != line1:
            return False
    # Both files were identical to this point.  See if either file
    # has more data.
    next0 = next(file0, '')
    next1 = next(file1, '')
    if next0 or next1:
        return False
    return True

def compare_subprocess_output(exe_path, input_path):
    with tempfile.TemporaryFile(mode='w+t', encoding='utf8') as temp_file:
        with open(input_path, 'r+t') as input_file:
            p = subprocess.Popen(
              stdout=temp_file,  # No more PIPE.
              stderr=subprocess.PIPE,  # <sigh>
            err = p.communicate()[1]  # No need to store output.
            # Compare input and output files...  This must be inside
            # the `with` block, or the TemporaryFile will close before
            # we can use it.
            if are_text_files_equal(temp_file, input_file):
                print('Failed: ' + str(err))

Unfortunately, since I can't reproduce your problem, even with a million-line input, I can't tell if this works. If nothing else, it ought to give you wrong answers faster.