Usman Shahid Usman Shahid - 2 months ago 6
PHP Question

How to make this crawler more efficient

I built this web crawler.

https://github.com/shoutweb/WebsiteCrawlerEmailExtractor

//Regular expression function that scans individual pages for emails
function get_emails_from_webpage($url)
{
$text=file_get_contents($url);
$res = preg_match_all("/[a-z0-9]+[_a-z0-9\.-]*[a-z0-9]+@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})/i",$text,$matches);
if ($res) {
return array_unique($matches[0]);
}
else{
return null;
}
}

//URL Array
$URLArray = array();

//Inputted URL right now it just pulls it from a GET variable but you can do alter this any way you want
$inputtedURL = $_GET['url'];


//Crawling the inputted domain to get the URLS
$urlContent = file_get_contents("http://".urldecode($inputtedURL));
$dom = new DOMDocument();
@$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

$scrapedEmails = array();

for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
//array_push($scrapedEmails, $hrefs->length);
// validate url
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
if (strpos($url, $inputtedURL) !== false) {
array_push($URLArray, $url);
}

}
}

//Extracting the emails from URLS that were crawled
foreach ($URLArray as $key => $url) {
$emails = get_emails_from_webpage($url);

if($emails != null){
foreach($emails as $email) {
if(!in_array($email, $scrapedEmails)){
array_push($scrapedEmails,$email);
}
}
}
}


//Ouputting the scraped emails in addition to the the number of URLS crawled
foreach($scrapedEmails as $value) {
echo $value . " " . count($URLArray);
}


It basically goes to a domain that you enter, gets all the pages, and then checks to see if there is an email.

Each domain can take up to 30 seconds to crawl. I want to see if there is a way to speed up this webcrawler. One way I was thinking was to limit it to only contact pages, but I couldn't figure out a smart way of doing that.

Answer Source

Provided your intentions are not nefarious--

As mentioned in the comment, one way to achieve this is executing the crawler in parallel (multithreading)---as opposed to doing one domain at a time.

Something like:

exec('php crawler.php > /dev/null 2>&1 &');
exec('php crawler.php > /dev/null 2>&1 &');
exec('php crawler.php > /dev/null 2>&1 &');
exec('php crawler.php > /dev/null 2>&1 &');
exec('php crawler.php > /dev/null 2>&1 &');

On the server, you can setup a CRON job that will does it automatically, so that you are not running it manually.