DomainsFeatured DomainsFeatured - 2 months ago 14
Linux Question

How To Extract Text Between HTML Tags With Or Condition Multiple Times

I have been researching how to extract title tags from html. I've pretty much figured out that regex and html don't mix and that grep can be used. However, the code I found here, looks like this:

awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}'


Now, this works to find the text between title tags only once. I would like to know how I can make it run on every line. I could do a
cat file; while read line; do ...; done
. However, I know that is probably not very efficient an there's a better way.

Secondly, in the file I need to keep any lines that start with string '--'. I believe this requires adding an 'or' statement in
awk
so that it will match the title tags and any line starting with '--'

The input file would look like this:

text text text <title>random text of the title 1</title> random html stuff
--time--
xyz more random text <title>random text of the title 2</title> hmtl text
--time--
some text <title>random text of the title 3</title> more text tags
--time--
text here <title>random text of the title 4</title> random text html
--time--


The desired output:

<title>random text of the title 1</title>
--time--
<title>random text of the title 2</title>
--time--
<title>random text of the title 3</title>
--time--
<title>random text of the title 4</title>
--time--


I'm not that great with awk, but I'm learning. I know there should be an option to print all, but it's the OR statement that I'm really stuck on. I am open to sed or grep if you think that's more efficient. Any help or direction is greatly appreciated.

Answer

For your given input, grep is enough

$ grep -o '<.*>\|^--.*' ip.html 
<title>random text of the title 1</title>
--time--
<title>random text of the title 2</title>
--time--
<title>random text of the title 3</title>
--time--
<title>random text of the title 4</title>
--time--
  • -o extract only matching parts
  • <.*> extract from < upto last > in the line
  • \|^--.* alternate pattern, if line starts with -- get everything from that line

To restrict only to title tags,

grep -o '<title.*title>\|^--.*' ip.html
Comments