Ranieri Mazili Ranieri Mazili - 6 months ago 31
Bash Question

How to get only part of a line using grep/sed/awk with regex?

I have an HTML file of which I need to get only an specific part. The biggest challenge here is that this HTML file doesn't have linebreaks, so my grep expression isn't working well.

Here is my HTML file:

<a href="/link1" param1="data1_1" param2="1_2"><p>Test1</p></a><a href="/link2" param1="data1_1" param2="1_2"><p>Test2</p></a>


Note that I have two anchors (
<a>
) on this line.

I want to get the second anchor and I was trying to get it using:

cat example.html | grep -o "<a.*Test2</p></a>"


Unfortunately, this command returns the whole line, but I want only:

<a href="/link2" param1="data1_1" param2="1_2"><p>Test2</p></a>


I don't know how to do this with grep or sed, I'd really appreciate any help.

Answer

With GNU awk for multi-char RS, if it's the second record you want:

$ awk 'BEGIN{RS="</a>"; ORS=RS"\n"} NR==2' file
<a href="/link2" param1="data1_1" param2="1_2"><p>Test2</p></a>

or if it's the record labeled "Test2":

$ awk 'BEGIN{RS="</a>"; ORS=RS"\n"} /<p>Test2<\/p>/' file
<a href="/link2" param1="data1_1" param2="1_2"><p>Test2</p></a>

or:

$ awk 'BEGIN{RS="</a>"; ORS=RS"\n"; FS="</?p>"} $2=="Test2"' file
<a href="/link2" param1="data1_1" param2="1_2"><p>Test2</p></a>
Comments