Turtle Turtle - 1 year ago 36
Bash Question

Extracting patterns with awk within a bash script

I have this tab delimited file.


chr1 10111412 apples
chr2 195121230 pears
chr2 991924122 elephants

If I want stuff in column 1 from chr2,

awk '/chr2\t/ Test.txt


chr2 195121230 pears
chr2 991924122 elephants

But if I have a couple hundred million lines from chr1 to chr25, and need to split them up into chr-specific text files, I thought of doing this:

for num in $(seq 1 25)
awk '/chr$num\t/' Test.txt > chr$num.txt

I also tried changing the awk to sed

sed -n 'chr$num\t/p' Test.txt

Both of course failed spectacularly. I suspect the script recognises
as a single variable. How can I break this recognition pattern and get the script to work?


It can be done much simpler with awk:

awk '{print >> $1".txt"}' input.file

That's it.

If the file is pretty large and you have a high number of different values of the first column you may run out of file descriptors. In that case you need to close the file after writing to it:

awk '{f=$1".txt"; print >> f; close(f)}' input.file