Chris Null Chris Null - 4 months ago 6
Bash Question

Join and delete lines based on patern

I have a file with 200,000+ lines. The lines are grouped. The beginning of each group of rows starts with "IMAGE" followed by one row that starts with "HISTO" and then at least one, but usually multiple, rows that start with "FRAG".
I need to:

1. Delete any row that starts with "HISTO".

2. For each "FRAG" line I need to join it with the previous "IMAGE" row.
Here is an example.

>IMAGE ...data1...
>HISTO usually numbers 0 0 1 1 0 1 0
>FRAG ...data1...
>FRAG ...data2...
>IMAGE ...data2...
>HISTO usually numbers 0 0 1 1 0 1 0
>FRAG ...data1...
>FRAG ...data2...
>FRAG ...data3...
>FRAG ...data4...


The result needs to look like this:

>IMAGE ...data1... FRAG ...data1...
>IMAGE ...data1... FRAG ...data2...
>IMAGE ...data2... FRAG ...data1...
>IMAGE ...data2... FRAG ...data2...
>IMAGE ...data2... FRAG ...data3...
>IMAGE ...data2... FRAG ...data4...


It is possible to have many FRAG lines before it starts over with an IMAGE line. I am using mac so I can use pretty much any tool.

I tried this but it is combining multiple FRAG lines to a single IMAGE line.


awk '/^IMAGE/{if(NR>1)print a; a=$0} /^(FRAG)/{a=a" "$0}' Input.txt > output.txt


That results in this:


IMAGE ...data1... FRAG ...data1... FRAG ...data2...

Answer

This works:

sed 's/>//' Input.txt|awk '/^IMAGE/{a=$0;next;} /^FRAG/{print ">"a,$0}'

The next statement is to avoid checking the FRAG pattern if it is a line with IMAGE, thus accelerating the process.