Stephan Tips Stephan Tips - 5 months ago 23
PHP Question

Display first 4 columns of external table

I am using Windows software to organize a tourpool. This program creates (among other things) HTML pages with rankings of participants. But these HTML pages are quite hideous, so I am building a site around it.

To show the top 10 ranking I need to select the first 10 out of about 1000 participants of the generated HTML file and put it on my own site.

To do this, I used:

// get top 10 ranks of p_rank.html
$file_contents = file_get_contents('p_rnk.htm');
$start = strpos($file_contents, '<tr class="header">');

// get end
$i = 11;
while (strpos($file_contents, '<tr><td class="position">'. $i .'</td>', $start) === false){
$i++;
}

$end = strpos($file_contents, '<td class="position">'. $i .'</td>', $start);

$code = substr($file_contents, $start, $end);
echo $code;


This way I get it to work, only the last 3 columns (previous position, up or down and details) are useless information. So I want these columns deleted or find a way to only select and display the first 4.

How do i manage this?




EDIT



I managed to get the first 10 rows out of the HTML file, but I can't seem to get the useless td's removed.

Here is my code so far:

<?php

$DOM = new DOMDocument;
$DOM->loadHTMLFile("p_rnk.htm");

$body = $DOM->getElementsByTagName('body')->item(0);
$tables = $body->getElementsByTagName('table');
for ($i = 0; $i < 2; $i++){
$body->removeChild($tables->item(0)); // Line 186
}

$table = $DOM->getElementsByTagName('table')->item(0);
$rows = $table->getElementsByTagName('tr');

$cut_rows_after = 10;
$cut_colomns_after = 3;

$row_index = $rows->length-1;

while($row = $rows->item($row_index)) {
if($row_index+1 > $cut_rows_after)
$table->removeChild($row);
else {
$tds = $row->getElementsByTagName('td');
$colomn_index = $tds->length-1;
while($td = $tds->item($colomn_index)) {
if($colomn_index+1 > $cut_colomns_after)
$row->removeChild($td);
$colomn_index--;
}
}
$row_index--;
}

echo $DOM->saveHTML();

?>





HTML FILE





<html>
<head>
<title>Pool</title>
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Cache-Control" content="no-cache">
<link rel="stylesheet" href="addStyle.css">
</head>

<body>

<center>
<!-- Table 1 -->
<table>
<tr>
<td class="pageheader">EINDSTAND</td>
</tr>
</table>

</center>

<!-- Table 2 -->
<table class="transparent">
<tr class="transparent">
<td class="transparent" style="vertical-align: top"></td>
<td class="transparent" align="center" width="100%">
<img src="i_general.gif" align="left">

<!-- Table 3 -->
<table align="center">
<tr class="header">
<td class="position">#</td>
<td class="participant">Deelnemer</td>
<td class="score">Punten</td>
<td class="previousscore">Vorig</td> <!-- TO BE DELETED -->
<td class="updown"><img src="i_stdal.gif"></td> <!-- TO BE DELETED -->
<td class="details">Details</td> <!-- TO BE DELETED -->
</tr>
<tr>
<td class="position">1</td>
<td class="participant">Dummy1</td>
<td class="score">1153</td>
<td class="previousscore">(2) 801 + 352</td> <!-- TO BE DELETED -->
<td class="updown"><img src="i_stijg.gif"></td> <!-- TO BE DELETED -->
<td class="details"> <!-- TO BE DELETED -->
<a href="p_dtls694.htm" class="tablelink">Details</a>
</td>
</tr>
<tr>
<td class="position">2</td>
<td class="participant">Dummy2</td>
<td class="score">1153</td>
<td class="previousscore">(2) 801 + 352</td> <!-- TO BE DELETED -->
<td class="updown"><img src="i_stijg.gif"></td> <!-- TO BE DELETED -->
<td class="details"> <!-- TO BE DELETED -->
<a href="p_dtls694.htm" class="tablelink">Details</a>
</td>
</tr>
<tr>
<td class="position">3</td>
<td class="participan2">Dummy3</td>
<td class="score">1153</td>
<td class="previousscore">(2) 801 + 352</td> <!-- TO BE DELETED -->
<td class="updown"><img src="i_stijg.gif"></td> <!-- TO BE DELETED -->
<td class="details"> <!-- TO BE DELETED -->
<a href="p_dtls694.htm" class="tablelink">Details</a>
</td>
</tr>
<!-- etc... -->
</table>
</td>
</tr>
</table>


</body>
</html>




Answer

I'd say that the best way to deal with such stuff is to parse the html document (see, for instance, the first anwser here) and then manipulate the object that describes DOM. This way, you can easily extract the table itself using various selectors, get your 10 first records in a simpler manner and also will be able to remove unnecessary child (td) nodes from each line (using removeChild). When you're done with modifying, dump the resulting HTML using saveHTML.

Update:

ok, here's a tested code. I removed the necessity to hardcode the numbers of colomns and rows and separated the desired numbers of colomns and rows into a couple of variables (so that you can adjust them if neede). Give the code a closer look: you'll notice some details which were missing in you code (index is 0..999, not 1..1000, that's why all those -1s and +1s appear; it's better to decrease the index instead of increasing because in this case you don't have to case about numeration shifts on removing; I've also used while instead of for not to care about cases of $rows->item($row_index) == null separately):

<?php
    $DOM = new DOMDocument;
    $DOM->loadHTMLFile("./table.html");

    $table = $DOM->getElementsByTagName('tbody')->item(0);
    $rows = $table->getElementsByTagName('tr');

    $cut_rows_after = 10;
    $cut_colomns_after = 4;

    $row_index = $rows->length-1;
    while($row = $rows->item($row_index)) {
        if($row_index+1 > $cut_rows_after)
            $table->removeChild($row);
        else {
            $tds = $row->getElementsByTagName('td');
            $colomn_index = $tds->length-1;
            while($td = $tds->item($colomn_index)) {
                if($colomn_index+1 > $cut_colomns_after)
                    $row->removeChild($td);
                $colomn_index--;
            }
        }
        $row_index--;
    }

    echo $DOM->saveHTML();
?>

Update 2:

If the page doesn't contain tbody, use the container which is present. For instance, if tr elements are inside a table element, use $DOM->getElementsByTagName('table') instead of $DOM->getElementsByTagName('tbody').