user2472083 user2472083 - 3 months ago 18
PHP Question

Trouble parsing just img src from RSS feed?

I am trying to create an RSS reader based on this example:

http://www.w3schools.com/php/php_ajax_rss_reader.asp

Specifically, I am attempting to modify this example so that the reader will access and display all the available comic images (and nothing else) from any given web comic RSS feed. I realize that it may be necessary to make the code at least a little site-specific, but I am trying to make it as general-purpose as possible. Currently, I have modified the initial example to produce a reader that displays all the comics of a given list of RSS feeds.. However, it also displays other unwanted text information that I am trying to get rid of. Here is my code so far, with a few feeds that are giving me trouble in particular:

index.php file:

<html>
<head>
<script>
function showRSS()
{
if (window.XMLHttpRequest)
{
// code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
} else
{ // code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.onreadystatechange=function()
{
if (xmlhttp.readyState==4 && xmlhttp.status==200)
{
document.getElementById("rssOutput").innerHTML=xmlhttp.responseText;
}
}
xmlhttp.open("GET","logger.php",true);
xmlhttp.send();
}
</script>
</head>
<body onload="showRSS()">
<div id="rssOutput"></div>
</body>
</html>


(pretty sure there's nothing wrong with this file; I think the problems arise in the next one although I included this one for completeness)

logger.php:

<?php

//function to get all comics from an rss feed
function getComics($xml)
{
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);

$x=$xmlDoc->getElementsByTagName('item');
foreach ($x as $x)
{
$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;
//output the comic
echo ($comic_image . "</p>");
echo ("<br>");
}

}

//create array of all RSS feed URLs
$URLs =
[
"SMBC" => "http://www.smbc-comics.com/rss.php",
"garfieldMinusGarfield" => "http://garfieldminusgarfield.net/rss",
"babyBlues" => "http://www.comicsyndicate.org/Feed/Baby%20Blues",
];

//Loop through all RSS feeds
foreach ($URLs as $xml)
{
getComics($xml);
}

?>


Because this method includes extra text in between the comic images (a lot of random stuff with SMBC, just a few advertisement links for gMg, and a copyright link for baby blues), I looked at the RSS feeds and concluded that the problem is that it's the description tag that includes the image source, but also includes other stuff. Next, I tried modifying the getComics function to scan directly for the image tag, rather than first looking for the description tag. I replaced the part in between the DOMDocument creation/loading and the URL list with:

$images=$xmlDoc->getElementsByTagName('img');
print_r($images);

foreach ($images as $image)
{
//echo $image->item(0)->getAttribute('src');
echo $image->item(0)->nodeValue;
echo ("<br>");
}


but apparently getElementsByTagName doesn't pick up the image tag embedded inside the description tag, because I get no comic images outputted, and the following output from the print_r statement:

DOMNodeList Object ( [length] => 0 ) DOMNodeList Object ( [length] => 0 )


Finally, I tried a combination of the two methods, trying to use getElementsByTagNam('img') inside the code that parses out the description tag contents. I replaced the line:

$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;


with:

$comic_image=$x->getElementsByTagName('description')->item(0)->getElementsByTagName('img');
print_r($comic_image);


But this also finds nothing, producing the output:

DOMNodeList Object ( [length] => 0 )


So sorry for the really long background, but I'm wondering if there is a way to parse just the img src out of a given RSS feed without the other text and links I don't want?

Help would be much appreciated

Answer

Internally, description content is escaped, so the following code should work:

foreach ($x as $y) {
    $description = $y->getElementsByTagName('description')->item(0);
    $decoded_description = htmlspecialchars_decode($description->nodeValue);
    $description_xml = new DOMDocument();
    $description_xml->loadHTML($decoded_description);
    $comic_image = $description_xml->getElementsByTagName('img')->item(0)->getAttribute('src');

    //output the comic
    echo ($comic_image);
    echo ("<br>");
}
Comments