user2472083 user2472083 - 1 year ago 58
PHP Question

Trouble parsing just img src from RSS feed?

I am trying to create an RSS reader based on this example:

Specifically, I am attempting to modify this example so that the reader will access and display all the available comic images (and nothing else) from any given web comic RSS feed. I realize that it may be necessary to make the code at least a little site-specific, but I am trying to make it as general-purpose as possible. Currently, I have modified the initial example to produce a reader that displays all the comics of a given list of RSS feeds.. However, it also displays other unwanted text information that I am trying to get rid of. Here is my code so far, with a few feeds that are giving me trouble in particular:

index.php file:

function showRSS()
if (window.XMLHttpRequest)
// code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
} else
{ // code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
if (xmlhttp.readyState==4 && xmlhttp.status==200)
<body onload="showRSS()">
<div id="rssOutput"></div>

(pretty sure there's nothing wrong with this file; I think the problems arise in the next one although I included this one for completeness)



//function to get all comics from an rss feed
function getComics($xml)
$xmlDoc = new DOMDocument();

foreach ($x as $x)
//output the comic
echo ($comic_image . "</p>");
echo ("<br>");


//create array of all RSS feed URLs
$URLs =
"SMBC" => "",
"garfieldMinusGarfield" => "",
"babyBlues" => "",

//Loop through all RSS feeds
foreach ($URLs as $xml)


Because this method includes extra text in between the comic images (a lot of random stuff with SMBC, just a few advertisement links for gMg, and a copyright link for baby blues), I looked at the RSS feeds and concluded that the problem is that it's the description tag that includes the image source, but also includes other stuff. Next, I tried modifying the getComics function to scan directly for the image tag, rather than first looking for the description tag. I replaced the part in between the DOMDocument creation/loading and the URL list with:


foreach ($images as $image)
//echo $image->item(0)->getAttribute('src');
echo $image->item(0)->nodeValue;
echo ("<br>");

but apparently getElementsByTagName doesn't pick up the image tag embedded inside the description tag, because I get no comic images outputted, and the following output from the print_r statement:

DOMNodeList Object ( [length] => 0 ) DOMNodeList Object ( [length] => 0 )

Finally, I tried a combination of the two methods, trying to use getElementsByTagNam('img') inside the code that parses out the description tag contents. I replaced the line:




But this also finds nothing, producing the output:

DOMNodeList Object ( [length] => 0 )

So sorry for the really long background, but I'm wondering if there is a way to parse just the img src out of a given RSS feed without the other text and links I don't want?

Help would be much appreciated

Answer Source

Internally, description content is escaped, so the following code should work:

foreach ($x as $y) {
    $description = $y->getElementsByTagName('description')->item(0);
    $decoded_description = htmlspecialchars_decode($description->nodeValue);
    $description_xml = new DOMDocument();
    $comic_image = $description_xml->getElementsByTagName('img')->item(0)->getAttribute('src');

    //output the comic
    echo ($comic_image);
    echo ("<br>");