N. Smeding N. Smeding - 1 month ago 9
Javascript Question

crawling page does only work for 3%

I am trying to crawl a full section of a website, but the problem is that the data that I need is not there from the start. Is there anyway to get the data from the website with PHP?

this is the link: https://www.iamsterdam.com/nl/uit-in-amsterdam/uit/agenda and this is the section I need:


After my post was set to duplicate I tried this
http://stackoverflow.com/a/28506533/7007968 but is also doesn't work so I need a other solucion this is what I tried:

get-website.php

$phantom_script= 'get-website.js';


$response = exec ('phantomjs ' . $phantom_script);

echo $response;


get-website.js

var webPage = require('webpage');
var page = webPage.create();

page.open('https://www.iamsterdam.com/nl/uit-in-amsterdam/uit', function(status) {
console.log(page.content);
phantom.exit();
});


this is all I get back (around 3% of the page):

</div><div id="ads"></div><script src="https://analytics.twitter.com/i/adsct?p_id=Twitter&amp;p_user_id=0&amp;txn_id=nvk6a&amp;events=%5B%5B%22pageview%22%2Cnull%5D%5D&amp;tw_sale_amount=0&amp;tw_order_quantity=0&amp;tw_iframe_status=0&amp;tpx_cb=twttr.conversion.loadPixels" type="text/javascript"></script></body></html>


So I have the feeling that i am getting closer this is what I after a lot of searching:

var webPage = require('webpage');
var page = webPage.create();
var settings = {
operation: "POST",
encoding: "utf8",
headers: {
"Content-Type": "application/json"
},
data: JSON.stringify({
DateFilter: 04112016,
LastMinuteTickets: 0,
PageId: "3418a37d-b907-4c80-9d67-9fec68d96568",
Skip: 0,
Take: 12,
ViewMode: 1
})
};

page.open('https://www.iamsterdam.com/api/AgendaApi/', settings, function(status) {
console.log(page.content);
phantom.exit();
});


But what I get back doesn't look good:

Message":"An error has occurred.","ExceptionMessage":"Page could not be found","ExceptionType":"System.ApplicationException","StackTrace":" at Axendo.SC.AM.Iamsterdam.Controllers.Api.AgendaApiController.GetResultsInternal(RequestModel requestModel)\r\n at lambda_method(Closure , Object , Object[] )\r\n


etc.

I hope somewann can help me,

Answer

Addressing your main question about 3%. You use exec incorrectly. When used like this

$response =  exec ('phantomjs ' . $phantom_script);

$response will containt the last line of what was printed in terminal during execution of a given command. Because you did console.log(page.contents); the last line of HTML document was placed into $response variable.

The correct use of exec would be

exec ('phantomjs ' . $phantom_script, $response);

This way the result will be placed into $response variable as an array, with each line an element of the array. Then, if you just want to get html, you can do

$html = implode("\n", $response);

But a more simple and correct way is to use the specific function for the task:

passthru ('phantomjs ' . $phantom_script);

passthru executes a function and returns recieved data unmodified, straight to the output.

So if you want to contain it to a variable, do:

ob_start();
passthru ('phantomjs ' . $phantom_script);
$html = ob_get_clean();
Comments