ohadinho ohadinho - 7 months ago 59
HTML Question

Scrap anchor (<a>) html tags

I need to scrap

<a>
tags in html.

My goal is to scrap tags that has valid links inside their href attribute.

I think i'm very close to the answer, and this is the regex I wrote:

<a .*href=("|').*\.asp("|').*?>.*?<\/a>


http://regexr.com/3d989

FIRST ISSUE:

Result:

<a id='topnavbtn_tutorials' href='javascript:void(0);' onclick='w3_open_nav("tutorials")' title='Tutorials'>TUTORIALS <i class='fa fa-caret-down'></i><i class='fa fa-caret-up' style='display:none'></i></a><a id='topnavbtn_references' href='javascript:void(0);' onclick='w3_open_nav("references")' title='References'>REFERENCES <i class='fa fa-caret-down'></i><i class='fa fa-caret-up' style='display:none'></i></a><a id='topnavbtn_examples' href='javascript:void(0);' onclick='w3_open_nav("examples")' title='Examples'>EXAMPLES <i class='fa fa-caret-down'></i><i class='fa fa-caret-up' style='display:none'></i></a><a href='/forum/default.asp'>FORUM</a>


and I only need:

<a href='/forum/default.asp'>FORUM</a>


SECOND ISSUE:

Result:

<a href='/html/default.asp' class='w3-hide-small' title='HTML Tutorial'>HTML</a><a href='/css/default.asp' class='w3-hide-small' title='CSS Tutorial'>CSS</a><a href='/js/default.asp' class='w3-hide-small' title='JavaScript Tutorial'>JAVASCRIPT</a><a href='/sql/default.asp' class='w3-hide-small' title='SQL Tutorial'>SQL</a><a href='/php/default.asp' class='w3-hide-small' title='PHP Tutorial'>PHP</a><a href='/bootstrap/default.asp' class='w3-hide-small' title='Bootstrap Tutorial'>BOOTSTRAP</a><a href='/jquery/default.asp' class='w3-hide-small' title='jQuery Tutorial'>JQUERY</a><a href='/angular/default.asp' class='w3-hide-small' title='Angular Tutorial'>ANGULAR</a><a href='/xml/default.asp' class='w3-hide-small' title='XML Tutorial'>XML</a>


and I need them as seperate results:

<a href='/html/default.asp' class='w3-hide-small' title='HTML Tutorial'>HTML</a>

<a href='/css/default.asp' class='w3-hide-small' title='CSS Tutorial'>CSS</a>

<a href='/js/default.asp' class='w3-hide-small' title='JavaScript Tutorial'>JAVASCRIPT</a>


and so on...

Answer

Updated. See below.

If you have the HTML in string form, you can do something like this:

// split the string up by anchor tags
// nested anchor tags is illegal, so this seems feasible:
var anchorArray = str.replace(/><a/g, '>¶<a').split('¶'); // ¶ is a placeholder to split

var matches = [];
var re = /<a .*href=["'].*\.asp["'].*?>.*?<\/a>/g;

// filter out the anchor elements with actual links in the final HTML
anchorArray.filter(function(element) { 
    if (re.test(element)) {
        matches.push(element); // keep the match in an array (2nd condition)
        return false; 
    }
    else return true;       
});

var returnedHTML = anchorArray.join('');  // HTML w/o actual links (1st condition)

Note that the preferred means of parsing HTML is not with regex, but with an HTML parser.

Comments