Swastik Udupa Swastik Udupa - 9 months ago 131
Bash Question

Extract text between the two anchor tags using sed, grep or awk

<div class="plot_summary minPlotHeightWithPoster">
<div class="summary_text" itemprop="description">
King Leonidas of Sparta and a force of 300 men fight the Persians at Thermopylae in 480 B.C.

I want to extract the text between the two div anchor tags. I am a newbie to sed and awk and so, I couldn't figure out how to do that. I tried it using grep but it was unsuccessful.

Answer Source

Recommended method to parse XML or HTML at a Unix or Unix like terminal:

If you are looking for a way to do this from the unix command line, I suggest first considering an xml parsing tool instead of awk, grep, or sed.

For example most systems have xmllint. If your html was contained in the file index.html. The following xmllint command works to extract the text:

xmllint --xpath "//div[contains(@class, 'plot_summary')]/div[contains(@class, 'summary_text')]/text()" index.html

The text needs trimming after that command so you'd probably pipe to another command to do that:

xmllint --xpath "//div[contains(@class, 'plot_summary')]/div[contains(@class, 'summary_text')]/text()" index.html | sed -e 's/^[[:space:]]*//' -e '/^[[:space:]]*$/d'

That sed command we are pipping the output to has two expressions. The first deletes white space at the beginning of the line 's/^[[:space:]]*//' and the second deletes any lines that are just white space '/^[[:space:]]*$/d'

There are other xml command line parser tools you can research (see accepted answer): How to execute XPath one-liners from shell?