nonrandom_passer nonrandom_passer - 4 months ago 9
Perl Question

Remove text between substrings (no matter on the same line or multiline) only if it contains pattern

There is some data (xml) in a file, and I need to remove text (not the whole line, so /d option of sed does not suit) from Substring1 up to Substring2 (including both) only if contains a pattern.
My problem here is that there could be various formatting, so Substring1 and Substring2 can be either on the same line or on different, or there could be several pairs of Substrin1/2 on the same line.

Example (1st line - 2 pairs of Substrings1/2 and first one contains PATTERN, 2nd line - 1 pair with PATTERN, 3rd line - 1 pair without PATTERN, 4th and 5th lines - 1 pair with PATTERN, 6th and 7th lines - 1 pair without PATTERN):

Substring1 =

<?xml


Substring2 =
</update>


Pattern =
PATTERN


tmp.log
<?xml version="1.0" encoding="UTF-8" PATTERN-line1 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update><?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line2 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" PATTERN-line4 <upd_date>2016-03-24</upd_date>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line5 </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <upd_date>2016-03-24</upd_date>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>

Expected output:
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line1 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line3 <upd_date>2016-03-24</upd_date><upd_time>00:01:00.200</upd_time> blah-blah-blah </update>
<?xml version="1.0" encoding="UTF-8" blah-blah-blah-line6 <upd_date>2016-03-24</upd_date>
<upd_time>00:01:00.200</upd_time> blah-blah-blah-line7 </update>


I`ve tried (without full success) different combinations like the following:

sed -i "s#<?xml.*PATTERN.*</update>##g" tmp.log

sed -i "#<?xml#{p; :a; N; #</update>#!ba; s#.*\n##}; p" tmp.log

perl -pi -e 's/<?xml.*PATTERN.*update>//' tmp.log


As far as I can see, these remove whole lines and skip the case when substrings are located on different lines. I also do not perform real checking for PATTERN here. Any help appreciated.

Answer

With gawk:

awk -v RS='<\\?xml' 'NR!=1 && !(/PATTERN/){print "<?xml",$0}'