Floran Gmehlin Floran Gmehlin - 5 months ago 33
Bash Question

Extract multiple lines from large text file with sed while preserving each trailing newline (Bash Script)

I have a large text file of several millions of line of which I need to extract specific lines.

Since I need to extract about 300000 lines (line numbers to be extracted are read from a file), I process them in batch of x lines (say 200) to speed up the process with the following command :

sed '1000p;1002p;2003p;...(200 times)...10001q;d' large_text_file >> extracted.txt


First I construct the string
1000p;1002p;2003p;...(200 times)...10001q;d
, then I call the
sed
command with the string as argument and repeat this until all lines are processed.

sed_lines="1000p;1002p;2003p;...(200 times)...10001q;d"
sed "$sed_lines" large_text_file >> extracted.txt


The problem I have is that the these 200 lines are now considered as one single line as
sed
does not keep the
\n
at the end of each line.

Question 1: Is there an option in sed for preserving the \n at the end of each line ?

Answer 1: Ok I figured this quickly after writing this post. Basically I missed the double quotes around
$sentences
in the line :

echo $sentences >> $forig.pseudo ==> echo "$sentences" >> $forig.pseudo


Question 2: Is there a faster way to do this ?

Answer 2: fedorqui's answer with
awk
is really fast and efficient

For the sake of comprehension, here is the bulk of script that does this process (edited with fedorqui's comment about the while):

echo "Extracting lines"
sed_lines=""
while IFS=$'\t' read -r linenr rest; do
sed_lines+="$linenr" # Append line number
((cnt++)) # Batch counter
if [ "$cnt" -eq 200 ]; then
sed_lines+="q;d"
sentences=$(sed "$sed_lines" $forig) # Extract lines from file
((thres_cnt+=100))
echo "$thres_cnt lines processed"
echo $sentences >> $forig.pseudo # Write lines to new file
sed_lines=""
cnt=0
else
sed_lines+="p;"
fi
done < "$fperp"_cut_sorted

Answer

What about using awk for this? Firstly store the line number in an array and then just keep checking if the line number of the file is in that array:

awk 'FNR==NR{line[$0]=$0; next} FNR in line' line_numbers file

Sample

$ cat line_numbers
8
16
4
6
9
$ cat file
1 hello
2 hello
3 hello
4 hello
5 hello
6 hello
7 hello
8 hello
9 hello
10 hello
11 hello
12 hello
13 hello
14 hello
15 hello
16 hello
17 hello
18 hello
19 hello
20 hello
$ awk 'FNR==NR{line[$0]=$0; next} FNR in line' line_numbers file 
4 hello
6 hello
8 hello
9 hello
16 hello