Floran Gmehlin Floran Gmehlin - 1 year ago 112
Bash Question

Extract multiple lines from large text file with sed while preserving each trailing newline (Bash Script)

I have a large text file of several millions of line of which I need to extract specific lines.

Since I need to extract about 300000 lines (line numbers to be extracted are read from a file), I process them in batch of x lines (say 200) to speed up the process with the following command :

sed '1000p;1002p;2003p;...(200 times)...10001q;d' large_text_file >> extracted.txt

First I construct the string
1000p;1002p;2003p;...(200 times)...10001q;d
, then I call the
command with the string as argument and repeat this until all lines are processed.

sed_lines="1000p;1002p;2003p;...(200 times)...10001q;d"
sed "$sed_lines" large_text_file >> extracted.txt

The problem I have is that the these 200 lines are now considered as one single line as
does not keep the
at the end of each line.

Question 1: Is there an option in sed for preserving the \n at the end of each line ?

Answer 1: Ok I figured this quickly after writing this post. Basically I missed the double quotes around
in the line :

echo $sentences >> $forig.pseudo ==> echo "$sentences" >> $forig.pseudo

Question 2: Is there a faster way to do this ?

Answer 2: fedorqui's answer with
is really fast and efficient

For the sake of comprehension, here is the bulk of script that does this process (edited with fedorqui's comment about the while):

echo "Extracting lines"
while IFS=$'\t' read -r linenr rest; do
sed_lines+="$linenr" # Append line number
((cnt++)) # Batch counter
if [ "$cnt" -eq 200 ]; then
sentences=$(sed "$sed_lines" $forig) # Extract lines from file
echo "$thres_cnt lines processed"
echo $sentences >> $forig.pseudo # Write lines to new file
done < "$fperp"_cut_sorted

Answer Source

What about using awk for this? Firstly store the line number in an array and then just keep checking if the line number of the file is in that array:

awk 'FNR==NR{line[$0]=$0; next} FNR in line' line_numbers file


$ cat line_numbers
$ cat file
1 hello
2 hello
3 hello
4 hello
5 hello
6 hello
7 hello
8 hello
9 hello
10 hello
11 hello
12 hello
13 hello
14 hello
15 hello
16 hello
17 hello
18 hello
19 hello
20 hello
$ awk 'FNR==NR{line[$0]=$0; next} FNR in line' line_numbers file 
4 hello
6 hello
8 hello
9 hello
16 hello
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download