d.ariel d.ariel - 1 month ago 8
MySQL Question

Using awk to replace text between two strings or patterns

I had a question ongoing here but I'm new to stack and somehow the thread got locked or removed: Thread

I'm working with a WordPress database with 60,000 or so "posts" inside the "post_content" column I want to remove those

<p>
html tags and the text in between them. . My post content looks like this:

<p style="text-align: left;"><span style="color: #fffff;">
An entire paragraph of text around 200 words
</span></p>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>


The
p
tags are going to be identical and only occur once per post, the exception is for the color which may possible be different on some posts.

Expected output should be like this:

[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>


I want to remove all the text in the paragraph tags. So what I would like to remove is the text "An entire paragraph of text around 200 words" This text is different on every single post but the one constant is the
<p>
open and close tag.

From my last question this command: By user "PS."

awk '/<p/,/<\/p>/{next} {print $0}' inputfile


Was ran on the .sql database after i had dumped it. But the text was still present after looking at the database.

Any help would be really appreciated.

Update: This question was solved by: Ed Morton

Using GNU awk for multi-char RS this:
awk -v RS='</p>\\s*' -v ORS= '{sub(/<p.*/,"")} 1' file

Answer

Using GNU awk for multi-char RS this:

awk -v RS='</p>\\s*' -v ORS= '{sub(/<p.*/,"")} 1' file

will work whether there's just 1 or multiple <p...</p> pairs in the file, e.g.:

$ cat file
<p style="text-align: left;"><span style="color: #fffff;">
First entire paragraph of text around 200 words
</span></p>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
<p style="text-align: left;"><span style="color: #fffff;">
Second entire paragraph of text around 200 words
</span></p>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>

.

$ awk -v RS='</p>\\s*' -v ORS= '{sub(/<p.*/,"")} 1' file
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="309" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>
[Text_between_brackets]
<iframe src="http://somewebsite.com" width="250" height="250" frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"></iframe>

The above is obviously fragile and can fail if, for example <p can appear in [Text_between_brackets]. The more of the <p... line you can specify in the sub() the less fragile it will be, e.g. MAYBE you can/should do something more like this:

awk -v RS='</p>\\s*' -v ORS= '{sub(/<p style="text-align: left;"><span style="color: /,"")} 1' file
Comments