Mostafa Ibrahim Mostafa Ibrahim - 2 months ago 39
PHP Question

Laravel grab RSS by guzzle request javascript

I am trying to grab RSS using below code.

<?php

$client = new \GuzzleHttp\Client(['User-Agent' => 'idap']);
$content = $client->request('GET', 'alarabiya.net/.mrss/ar.xml');

dd($content->getBody()->getContents());


and it returns the following:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n
<html>\n
<head>\n
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">\n
<meta http-equiv="Content-Script-Type" content="text/javascript">\n
<script type="text/javascript">\n
function getCookie(c_name) { // Local function for getting a cookie value\n
if (document.cookie.length > 0) {\n
c_start = document.cookie.indexOf(c_name + "=");\n
if (c_start!=-1) {\n
c_start=c_start + c_name.length + 1;\n
c_end=document.cookie.indexOf(";", c_start);\n
\n
if (c_end==-1) \n
c_end = document.cookie.length;\n
\n
return unescape(document.cookie.substring(c_start,c_end));\n
}\n
}\n
return "";\n
}\n
function setCookie(c_name, value, expiredays) { // Local function for setting a value of a cookie\n
var exdate = new Date();\n
exdate.setDate(exdate.getDate()+expiredays);\n
document.cookie = c_name + "=" + escape(value) + ((expiredays==null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/";\n
}\n
function getHostUri() {\n
var loc = document.location;\n
return loc.toString();\n
}\n
setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '46.252.205.139', 10);\n
try { \n
location.reload(true); \n
} catch (err1) { \n
try { \n
location.reload(); \n
} catch (err2) { \n
\tlocation.href = getHostUri(); \n
} \n
}\n
</script>\n
</head>\n
<body>\n
<noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your browser.</noscript>\n
</body>\n
</html>\n


How can I get RSS from https://www.alarabiya.net/.mrss/ar.xml link. Also a lot of sites do not give full description in RSS. How can I get complete description by code like fivefilters.org did, and what if RSS file is big and takes a lot of time to load.

Thanks,

Answer

I have updated my answer to use the GuzzleHttp\Client. I have tested this code myself and works with GuzzleHttp version ^6.2. You have to use composer to install specific version just in case. I assume you know how to get the provided code (given below) up and running with composer.

Description

When we try to visit RSS feed http://www.alarabiya.net/.mrss/ar.xml it first tries to find the cookie for the IP from which the request is hitting to its server. If it do not find any cookie set for the IP then it sets the cookie with Cookie_Hash:IP. The part of code which sets cookie is:

setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '49.49.242.64', 10);

Once, the cookie is set, javascript code then redirects the browser. After redirection, since the cookie has been set for the IP, the request completes successfully. Thus the complete RSS feed is sent to the browser.

You can see read the full javascript source code where all these happen. The header request that needs to be sent with our guzzle request can be easily obtained from the Request header sent via browsers using debug tool of chrome/firefox.

Let us know if you have any confusions.

<?php

require_once 'vendor/autoload.php';

$client = new \GuzzleHttp\Client([
    'base_uri' => 'http://www.alarabiya.net/',
    'cookies' => true,
]);

$res = $client->request('GET', '/.mrss/ar.xml');

$firstResponse = $res->getBody();

// Search for following string
// setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '49.49.242.64', 10);
$pattern = '/[^setCookie\(\')](.*?),/';

preg_match_all($pattern, $firstResponse, $matches);

// You may have to adjust this
$cookie = $matches[1][4]; // YPF8827340282Jdskjhfiw_928937459182JAX666
$ip     = $matches[1][5]; // 49.49.242.64

$cookieName  = explode("'", $cookie)[1];
$cookieValue = explode("'", $ip)[1];

// Set cookie value, Cookie: $cookieName=$cookieValue

$res = $client->request('GET', '/.mrss/ar.xml', [
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 ' .
            '(KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36',
        'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,' .
            'image/webp,*/*;q=0.8',
        'Accept-Encoding' => 'gzip, deflate, sdch',
        'Cookie' => ["$cookieName=$cookieValue"],
        'Referer' => 'http://www.alarabiya.net/.mrss/ar.xml',
        'Upgrade-Insecure-Requests' => 1,
        'Connection' => 'keep-alive',
    ],
    // 'debug' => false, // Set to true for debugging
]);

echo $res->getBody();

Note: I have tested this code with "guzzlehttp/guzzle": "^6.2".