Sledge Hammer Sledge Hammer - 4 months ago 8
HTML Question

Sed Replace Links to Specific Domains with the Anchor Text

I have a lot of static HTML files containing links for various domains.

I need to replace all the links for specific domains only with their anchor text.

Here's the command I managed to come up with so far:

sed 's|<a .*example\.com.*>\(.*\)<\/a>|\1|Ig' file.html


And here's an example of how it should work:

This

<p>Some random text <a href="http://example.com/sample_page" title="Example Title">Anchor Text</a> | Some other random text <a href="http://example.org/">Different Anchor Text</a></p>


Should become this:

<p>Some random text Anchor Text | Some other random text <a href="http://example.org/">Different Anchor Text</a></p>


The command above works great when there's only one link per line, but with more than one it removes all of them no matter of the domain leaving only the last one's anchor text.

I've found a few other similar topics here but couldn't adapt any of the solutions for my problem. Of course it's entirely possible that I might have missed an already existing topic with a solution I haven't tried. Let me know if I haven't explained the problem clear enough or if I have missed to provide some important info.

//EDIT:

After replacing
.*
with
[^>]*
and the command looking like this:

sed 's|<a .*example\.com[^>]*>\(.*\)<\/a>|\1|Ig' file.html


the first closing
</a>
remains and it's being removed from the last one.

Here's an example result:

<p>Some random text Anchor Text</a> | Some other random text <a href="http://example.org/">Different Anchor Text</p>


Replacing
.*
with
[^<>]*
yields the same result.

Answer

You should note that . matches any character but a newline, and it also matches any angle brackets.

You can "temper" the . with a negated character class [^<]:

sed 's|<a [^>]*example\.com[^>]*>\([^>]*\)</a>|\1|Ig' file.html

This means that there can be no > inside the a tag. As > can appear in the contents you are dealing with, I guess a safer, though a bit slower alternative, is to use [^<] (as < should always be used as an entity).