Adam Spiers Adam Spiers - 1 year ago 49
Bash Question

lazy (non-buffered) processing of shell pipeline

I'm trying to figure out how to perform the laziest possible processing of a standard UNIX shell pipeline. For example, let's say I have a command which does some calculations and outputting along the way, but the calculations get more and more expensive so that the first few lines of output arrive quickly but then subsequent lines get slower. If I'm only interested in the first few lines then I want to obtain those via lazy evaluation, terminating the calculations ASAP before they get too expensive.

This can be achieved with a straight-forward shell pipeline, e.g.:

./expensive | head -n 2


However this does not work optimally. Let's simulate the calculations with a script which gets exponentially slower:

#!/bin/sh

i=1
while true; do
echo line $i
sleep $(( i ** 4 ))
i=$(( i+1 ))
done


Now when I pipe this script through
head -n 2
, I observe the following:


  • line 1
    is output.

  • After sleeping one second,
    line 2
    is output.

  • Despite
    head -n 2
    having already received two (
    \n
    -terminated) lines and exiting,
    expensive
    carries on running and now waits a further 16 seconds (
    2 ** 4
    ) before completing, at which point the pipeline also completes.



Obviously this is not as lazy as desired, because ideally
expensive
would terminate as soon as the
head
process receives two lines. However, this does not happen; IIUC it actually terminates after trying to write its third line, because at this point it tries to write to its
STDOUT
which is connected through a pipe to
STDIN
the
head
process which has already exited and is therefore no longer reading input from the pipe. This causes
expensive
to receive a
SIGPIPE
, which causes the
bash
interpreter running the script to invoke its
SIGPIPE
handler which by default terminates running the script (although this can be changed via the
trap
command).

So the question is, how can I make it so that
expensive
quits immediately when
head
quits, not just when
expensive
tries to write its third line to a pipe which no longer has a listener at the other end? Since the pipeline is constructed and managed by the interactive shell process I typed the
./expensive | head -n 2
command into, presumably that interactive shell is the place where any solution for this problem would lie, rather than in any modification of
expensive
or
head
? Is there any native trick or extra utility which can construct pipelines with the behaviour I want? Or maybe it's simply impossible to achieve what I want in
bash
or
zsh
, and the only way would be to write my own pipeline manager (e.g. in Ruby or Python) which spots when the reader terminates and immediately terminates the writer?

Answer Source

If all you care about is foreground control, you can run expensive in a process substitution; it still blocks until it next tries to write, but head exits immediately (and your script's flow control can continue) after it's received its input

head -n 2 < <(exec ./expensive)
# expensive still runs 16 seconds in the background, but doesn't block your program

In bash 4.4, these store their PIDs in $! and allow process management in the same manner as other background processes.

# REQUIRES BASH 4.4 OR NEWER
exec {expensive_fd}< <(exec ./expensive); expensive_pid=$!
head -n 2 <&"$expensive_fd"
kill "$expensive_pid"

Another approach is a coprocess, which has the advantage of only requiring bash 4.0:

# magic: store stdin and stdout FDs in an array named "expensive", and PID in expensive_PID
coproc expensive { exec ./expensive }

# read two lines from input FD...
head -n 2 <&"${expensive[0]}"

# ...and kill the process.
kill "$expensive_PID"
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download