Tài Nguyễn Tài Nguyễn - 3 months ago 21
Bash Question

awk remove characters match with html tag

I want to remove every html tag with awk from this regex:

/[<.*.>]/
if said regex is found in any field. I've been trying to make it work with sub or substr, I am unable to find the correct logic for this.

Input text:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation<br/><div style="margin-top:6px">< b>veniam:< /b>< /div> <br/><div style="margin-top:6px">< b>Confort:< /b></div>Comenzi volan; Cruise-control; Servodirectie; <br/>


Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitationveniam: Confort:Comenzi volan; Cruise-control; Servodirectie;

Answer

If you're not really parsing HTML but instead just want to remove everything between each <...> pair in a text file, then that'd be this with GNU awk for multi-char RS:

$ awk -v RS='<[^>]+>' -v ORS= '1' file
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitationveniam: Confort:Comenzi volan; Cruise-control; Servodirectie;
Comments