Andrés Chandía Andrés Chandía - 1 month ago 10
Bash Question

find and move text between boundaries

I have a huge text file with a collection of texts in this format:

<text id="1">
blah blah blah blah
blah blah
blah
</text>
<text id="2">
blah blah blah blah
blah blah
blah
</text>
<text id="3">


.....etc. up to 14.400

at some point(s) I have this situation:

<text id="XXX">
blah blah blah blah
blah blah
blah
</text>
**text out of bounds**
<text id="XXX">
blah blah blah blah
blah blah


I mean, somewhere there are text out of the boundaries of text tags, I need to locate those lines of text and move them inside of the previous block, so the resulting structure is like this:

<text id="XXX">
blah blah blah blah
blah blah
blah
**text moved in bounds**
</text>
<text id="XXX">
blah blah blah blah
blah blah


In other words, it can not be text beween
</text>
and
<text id="....

Answer

Just don't print the </text line until you see the next <text line or reach the end of the input file:

$ cat tst.awk
/<\/text/ { end = $0 ORS; next }
/<text/   { printf "%s", end; end="" }
{ print }
END { printf "%s", end }

$ awk -f tst.awk file
<text id="XXX">
blah blah blah blah
blah blah
blah
**text out of bounds**
</text>
<text id="XXX">
blah blah blah blah
blah blah

That will work in any awk on any OS and the only memory it'll use is just enough to store the longest </text line.