Turtle Turtle - 7 months ago 14
Bash Question

Extracting patterns with awk within a bash script

I have this tab delimited file.

Test.txt

chr1 10111412 apples
chr2 195121230 pears
chr2 991924122 elephants


If I want stuff in column 1 from chr2,

awk '/chr2\t/ Test.txt


Output:

chr2 195121230 pears
chr2 991924122 elephants


But if I have a couple hundred million lines from chr1 to chr25, and need to split them up into chr-specific text files, I thought of doing this:

#!/bin/sh
for num in $(seq 1 25)
do
awk '/chr$num\t/' Test.txt > chr$num.txt
done


I also tried changing the awk to sed

sed -n 'chr$num\t/p' Test.txt


Both of course failed spectacularly. I suspect the script recognises
'/chr$num\t/'
as a single variable. How can I break this recognition pattern and get the script to work?

Answer

It can be done much simpler with awk:

awk '{print >> $1".txt"}' input.file

That's it.


If the file is pretty large and you have a high number of different values of the first column you may run out of file descriptors. In that case you need to close the file after writing to it:

awk '{f=$1".txt"; print >> f; close(f)}' input.file