Brett Powell Brett Powell - 2 months ago 10
PHP Question

Find Tables by ID using Simple HTML DOM Parser

I wrote a database seeder last year that scrapes a stats website. Upon revisiting my code, it no longer seems to be working and I am a bit stumped as to the reason.

$html->find()
is supposed to return an array of elements found, however it seems to only be finding the first table when used.

As per the documentation, I instead tried using find() and specifying each table's ID, however this also seems to fail.

$table_passing = $html->find('table[id=passing]');


Can anyone help me figure out what is wrong here? I am at a loss as to why neither of these methods are working, where the page source clearly shows multiple tables and the IDs, where both approaches should work.

private function getTeamStats()
{
$url = 'http://www.pro-football-reference.com/years/2016/opp.htm';
$html = file_get_html($url);

$tables = $html->find('table');

$table_defense = $tables[0];
$table_passing = $tables[1];
$table_rushing = $tables[2];

//$table_passing = $html->find('table[id=passing]');

$teams = array();

# OVERALL DEFENSIVE STATISTICS #
foreach ($table_defense->find('tr') as $row)
{
$stats = $row->find('td');

// Ignore the lines that don't have ranks, these aren't teams
if (isset($stats[0]) && !empty($stats[0]->plaintext))
{
$name = $stats[1]->plaintext;
$rank = $stats[0]->plaintext;
$games = $stats[2]->plaintext;
$yards = $stats[4]->plaintext;

// Calculate the Yards Allowed per Game by dividing Total / Games
$tydag = $yards / $games;

$teams[$name]['rank'] = $rank;
$teams[$name]['games'] = $games;
$teams[$name]['tydag'] = $tydag;
}
}

# PASSING DEFENSIVE STATISTICS #
foreach ($table_passing->find('tr') as $row)
{
$stats = $row->find('td');

// Ignore the lines that don't have ranks, these aren't teams
if (isset($stats[0]) && !empty($stats[0]->plaintext))
{
$name = $stats[1]->plaintext;
$pass_rank = $stats[0]->plaintext;
$pass_yards = $stats[14]->plaintext;

$teams[$name]['pass_rank'] = $pass_rank;
$teams[$name]['paydag'] = $pass_yards;
}
}

# RUSHING DEFENSIVE STATISTICS #
foreach ($table_rushing->find('tr') as $row)
{
$stats = $row->find('td');

// Ignore the lines that don't have ranks, these aren't teams
if (isset($stats[0]) && !empty($stats[0]->plaintext))
{
$name = $stats[1]->plaintext;
$rush_rank = $stats[0]->plaintext;
$rush_yards = $stats[7]->plaintext;

$teams[$name]['rush_rank'] = $rush_rank;
$teams[$name]['ruydag'] = $rush_yards;
}
}

Answer

I never use simplexml or other derivatives but when using an XPath query to find an attribute such as ID usually one would prefix with @ and the value should be quoted - so for your case it might be

$table_passing = $html->find('table[@id="passing"]');

Using a standard DOMDocument & DOMXPath approach the issue was that the actual table was "commented out" in source code - so a simple string replacement of the html comments enabled the following to work - this could easily be adapted to the original code.

$url='http://www.pro-football-reference.com/years/2016/opp.htm';

$html=file_get_contents( $url );
/* remove the html comments */
$html=str_replace( array('<!--','-->'), '', $html );

libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;
$dom->loadHTML( $html );
libxml_clear_errors();  



$xp=new DOMXPath( $dom );
$tbl=$xp->query( '//table[@id="passing"]' );

foreach( $tbl as $n )echo $n->tagName.' > '.$n->getAttribute('id');

/* outputs */
table > passing
Comments