FloHe FloHe - 3 days ago 4
Bash Question

How does the shell generate input for awk

Say I have a file1 containing:

1,2,3,4


I can use awk to process that file like this;

awk -v FS="," '{print $1}' file1


Also I can invoke awk with a Here String, meaning I read from stdin:

awk -v FS="," '{print $1}' <<<"9,10,11,12"


Command 1 yields the result
1
and command
2
yields 9 as expected.

Now say I have a second file2:

4,5


If I parse both files with awk sequentally:

awk -v FS="," '{print $1}' file1 file2


I get:

1
4


as expected.

But if I'm mixing reading from stdin and reading from files, the content I'm reading from stdin gets ignored and only the content in the files get processed sequentially:

awk -v FS="," '{print $1}' file1 file2 <<<"9,10,11,12"
awk -v FS="," '{print $1}' file1 <<<"9,10,11,12" file2
awk -v FS="," '{print $1}' <<<"9,10,11,12" file1 file2


All three commands yield:

1
4


which means the content from stdin simply gets thrown away. Now what is the shell doing?

Interestingly if I change command 3 to:

awk -v FS="," '{print $1}' <<<"9,10,11,12",file1,file2


I simply get
9
, which makes sense, as file1/2 are just two more fields from stdin. But why is then

awk -v FS="," '{print $1}' <<<"9,10,11,12" file1 file2


not expanded to

awk -v FS="," '{print $1}' <<<"9,10,11,12 file1 file2"


which would also yield the result
9
?

And why does the content from stdin gets ignored? The same question arises for command 1 and 2. What is the shell doing here?

I tried out the commands on: GNU bash, version 4.2.53(1)-release

Answer

Standard input and input from files don't mix together well. This behavior is not exclusive to awk, you will find it in a lot of command line applications. It is logical if you think of it like this:

Files need to be processed one by one. The consuming application does not have control over when the input behind STDIN starts and stops. Look at echo a,b,c | awk -F, '{print $1}' file1 file2. In what order do the incoming "files" need to be read? When If you think about when FNR would need to be reset, or what FILENAME should be, it becomes clear that it is hard to make this right.

One trick that you can play, is to let awk (or any other program) read from a file descriptor generated by the shell. awk -F, '{print $1}' file1 <(echo 4,5,6) file2 will do what you expected in the first place.

What happens here, is that a proper file descriptor is created with the <(...) syntax (say: /proc/self/fd/11), and the reading program can treat it just like a file. It is the second argument, so it is the second file. FNR and FILENAME are all clear what they should be.

Comments