Luis Pereira Luis Pereira - 1 month ago 5
PHP Question

PHP recursion with the results put into a single array

I am trying to program a web crawler but I have no idea, how to create a recursion for parsing a webpage and adding all the endresults into a final array.
I never worked with php before but I did alot of research on the internet and figured already out, how to parse the page I want to scrape.

Please note, that I have changed the $url value and the array result below to some values which I have randomly generated in my mind.

<?php
include_once "simple_html_dom.php"; //http://simplehtmldom.sourceforge.net/

$url = "https://www.scrapesite.com/pagetoscrape/index.html";

function parseLink($link) {
$html = file_get_html($link);
$html = $html->find("/html/body/script[2]/text", 0);
preg_match('/\{(?:[^{}]|(?R))*\}/', $html, $matches); //this regex extracts a json array
$json = json_decode($matches[0]);
$data = ($json->props->contents);
return $data;
}
function getFolders($basepath, $data) {
$data = $data->folders;
$result = array();

foreach ($data as $value) {
$result[] = array("folder", $basepath . "/" . $value->filename, $value->href);
}

return $result;
}

$data = getFolders("", parseLink($url));
print_r ($data);

?>


This script works fine and it outputs the following:

Array
(
[0] => Array
(
[0] => folder
[1] => /1
[2] => https://www.scrapesite.com/pagetoscrape/sjdfi327943sad/index.html
)

[1] => Array
(
[0] => folder
[1] => /2
[2] => https://www.scrapesite.com/pagetoscrape/345fdsjjsdfsdf/index.html
)

[2] => Array
(
[0] => folder
[1] => /3
[2] => https://www.scrapesite.com/pagetoscrape/46589dsjodsiods/index.html
)

[3] => Array
(
[0] => folder
[1] => /4
[2] => https://www.scrapesite.com/pagetoscrape/345897dujfosfsd/index.html
)

[4] => Array
(
[0] => folder
[1] => /5
[2] => https://www.scrapesite.com/pagetoscrape/9dsfghshdfsds3/index.html
)

)



Now, the script should execute the getFolders function for every item in the above array. This may return another array of folder which should get parsed too.
And then I want to create a final array where all the folders ABSOLUTE paths ($basepath . "/" . $value->filename) and href links are listed.

I really appreciate every little hint.
I was able to find some example on the web but I can't figure out how to implement it here because I have almost no experience with programming languages in general.

Answer

Initialize an empty array and pass that as a reference to the getFolders() function. Keep putting the results of scraping inside this array. Also, you need to call getFolders() again inside the foreach loop of the getFolders(). Example below:

$finalResults = array();
getFolders("", parseLink($url), $finalResults);

Your getFolders() function signature will now look like below:

function getFolders($basepath, $data, &$finalResults) //notice the & before the $finalResults used for passing by reference

And, your foreach loop:

foreach ($data as $value) {
    $finalResults[] = array("folder", $basepath . "/" . $value->filename, $value->href);
    getFolders("", parseLink($value->href), $finalResults);
}

Above code is just an example. Change it according to your needs.