Dave Dave - 21 days ago 6
PHP Question

curl returns 404 on valid page

I've got a PHP function that checks a URL to make sure that (a.) there's some kind of server response, and (b.) it's not a 404.

It works just fine on every domain/URL I've tested, with the exception of bostonglobe.com, where it's returning a 404 for valid URLs. I'm guessing it has something to do with their paywall, but my function works fine on nytimes.com and other newspaper sites.

Here's an example URL that returns a 404:

https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html

What am I doing wrong?

function check_url($url){
$userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
$result = curl_exec($curl);
if ($result == false) {
//There was no response
$message = "No information found for that URL";
} else {
//What was the response?
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($statusCode == 404) {
$message = "No information found for that URL";
} else{
$message = "Good";
}
}
return $message;
}

Answer

The problem seems to come from you CURLOPT_NOBODY option.

I've tested your code both with and without this line and the http code returns 404 when CURLOPT_NOBODY is present, and 200 when it's not.

The PHP manual informs us that setting the CURLOPT_NOBODY option will transform your request method to HEAD, my guess is that the server on which bostonglobe.com is hosted doesn't support that method.