Xahed Kamal Xahed Kamal - 4 months ago 57
PHP Question

Scrape Images, Links and Texts serially using Goutte

I've the bellow code trying to take the

html
elements 1 by 1 serially include the tag it self without any
styles
and
classes
. Plus, i'm failing to get
images


$client = new Client();

$crawler = $client->request('GET', 'http://www.tutorialspoint.com/laravel/laravel_ajax.htm');

$crawler->filter('h1, h2, h3, h4, h5, h6, p, pre, p > img, div > img, p > a')->each(function(Crawler $node, $i){
if ($node->filter('p')){
echo $node->text()."<br/>";

} else if ($node->filter('pre')) {
echo '<code>'.$node->html().'</code><br/>';
}
});


But whatever I do, I'm either getting only texts when i use
$node->text()
or all html in that page when i use
$node->html()
in that page.

I'm trying to get for example
p
-
<p>Text Here</p>
.
img
-
<img src="default.jp"/>
.

Answer

The line $node->filter('p') will always return true, since the returned value of the function filter is a Crawler object, so the second else if never called.
If you want to check if a crawler has nodes in it you can use the count() function.

As for your code - I'm not so sure why this is what you are doing, but basically what your code does is check if the current element HAS a <p> child element (is that what you are trying to do?), and if it has - print the content of the parent's node text.

In order to get the nodes DOMElement from the Crawler ($node) you can use

$node->getNode(0)`

and using this node you can check the nodeName (==tag name), get the textContent (the content of the tag), etc.

Here is an example you can use:

$crawler = $client->request('GET', 'http://www.tutorialspoint.com/laravel/laravel_ajax.htm');

$crawler->filter('h1, h2, h3, h4, h5, h6, p, pre, p > img, div > img, p > a')->each(function(Crawler  $node, $i){
    if (in_array($node->getNode(0)->nodeName, ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'a'])) {
        echo "{$node->getNode(0)->nodeName} => {$node->getNode(0)->textContent}.<br/>\n";
    } elseif ($node->getNode(0)->nodeName == 'pre') {
        echo "pre => <code>".$node->html()."</code><br/>\n";
    } elseif ($node->getNode(0)->nodeName == 'img') {
        echo 'img => src="'.$node->getNode(0)->getAttribute('src')."\" <br/>\n";
    }
});