rongcon rongcon -4 years ago 73
PHP Question

Get data only from html table used preg_match_all in php

I have a html table like this :

<table ... >

<tbody ... >

<tr ... >
<td ...>
string...
</td>
<td ...>
string...
</td>
<td ...>
string...
</td>
<td ...>
string...
</td>
<td ...>
string...
</td>
</tr>
<tr ... >
<td ...>
string...
</td>
<td ...>
string...
</td>
<td ...>
string...
</td>
<td ...>
</td>
<td ...>
string...
</td>
</tr>
..............

</tbody>


</table>


This is a data table and I need to get all data from this.
The table will have many rows (
<tr></tr>
) . each row will have a fixed columns (
<td></td>
)(currently is 5 ).
remember each table,tr,td tag maybe formatted (where say "...")

And I hope everyone can help me to write a regex for
preg_match_all
function to get the data like this :

array(
0 => array(
0=> 'some data0',
1=> 'some data1',
2=> 'some data2',
3=> 'some data3',
4=> 'some data4',
)
1 => array(
0=> 'some data0',
1=> 'some data1',
2=> 'some data2',
3=> 'some data3',
4=> 'some data4',
)
2 => array(
0=> 'some data0',
1=> 'some data1',
2=> 'some data2',
3=> 'some data3',
4=> 'some data4',
)
..........
)


Now the example for your test, hopfully you can help me!!!

<table border="1" >
<tbody style="" >

<tr style="" >
<td style="color:blue;">
data0
</td>
<td style="font-size:15px;">
data1
</td>
<td style="font-size:15px;">
data2
</td>
<td style="color:blue;">
data3
</td>
<td style="color:blue;">
data4
</td>
</tr>
<tr style="" >
<td style="color:blue;">
data00
</td>
<td style="font-size:15px;">
data11
</td>
<td style="font-size:15px;">
data22
</td>
<td style="color:blue;">
data33
</td>
<td style="color:blue;">
data44
</td>
</tr>
<tr style="color:black" >
<td style="color:blue;">
data000
</td>
<td style="font-size:15px;">
data111
</td>
<td style="font-size:15px;">
data222
</td>
<td style="color:blue;">
data333
</td>
<td style="color:blue;">
data444
</td>
</tr>

</tbody>


</table>

Answer Source

You absolutely do NOT want to parse HTML with Regex.

There are far too many variations, for one, and more importantly, regex isn't very good with the hierarchal nature of HTML. It's best to use an XML parser or better-yet an HTML-specific parser.

Whenever I need to scrape HTML, I tend to use the Simple HTML DOM Parser library, which takes an HTML tree and parses it into a traversable PHP object, which you can query something like JQuery.

<?php
    require 'simplehtmldom/simple_html_dom.php';

    $sHtml = <<<EOS
    <table border="1" >
      <tbody style="" >
           <tr style="" > 
                 <td style="color:blue;">
                      data0
                  </td>
                    <td style="font-size:15px;">
                     data1
                  </td>
                    <td style="font-size:15px;">
                      data2
                  </td>
                    <td style="color:blue;">
                      data3
                  </td>
                    <td style="color:blue;">
                      data4
                  </td>
           </tr>
           <tr style="" > 
                 <td style="color:blue;">
                      data00
                  </td>
                    <td style="font-size:15px;">
                     data11
                  </td>
                    <td style="font-size:15px;">
                      data22
                  </td>
                    <td style="color:blue;">
                      data33
                  </td>
                    <td style="color:blue;">
                      data44
                  </td>
           </tr>
           <tr style="color:black" > 
                 <td style="color:blue;">
                      data000
                  </td>
                    <td style="font-size:15px;">
                     data111
                  </td>
                    <td style="font-size:15px;">
                      data222
                  </td>
                    <td style="color:blue;">
                      data333
                  </td>
                    <td style="color:blue;">
                      data444
                  </td>
           </tr>
      </tbody>
    </table>
EOS;

    $oHTML = str_get_html($sHtml);
    $oTRs = $oHTML->find('table tr');
    $aData = array();
    foreach($oTRs as $oTR) {
        $aRow = array();
        $oTDs = $oTR->find('td');

        foreach($oTDs as $oTD) {
            $aRow[] = trim($oTD->plaintext);
        }

        $aData[] = $aRow;
    }

    var_dump($aData);
?>

And the output:

array
  0 => 
    array
      0 => string 'data0' (length=5)
      1 => string 'data1' (length=5)
      2 => string 'data2' (length=5)
      3 => string 'data3' (length=5)
      4 => string 'data4' (length=5)
  1 => 
    array
      0 => string 'data00' (length=6)
      1 => string 'data11' (length=6)
      2 => string 'data22' (length=6)
      3 => string 'data33' (length=6)
      4 => string 'data44' (length=6)
  2 => 
    array
      0 => string 'data000' (length=7)
      1 => string 'data111' (length=7)
      2 => string 'data222' (length=7)
      3 => string 'data333' (length=7)
      4 => string 'data444' (length=7)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download