Noman Ali Noman Ali - 1 year ago 86
PHP Question

PHP Regex to find a substring from a big string - Matching start and end

I want to find title of pages from a huge haystack but that do not have any class or unique id, so i can't use DOM parser here, i am aware i must use regular expressions.
Here is example of what i am trying to find:

<a href="">
Series Hell In Heaven information
<a href="">
Series What is going information

Output Should be an array with

[0] => Series Hell In Heaven information
[1] => Series What is going information

All series titles have start with Series and end with information. from a huge string of multiple things i only want to extract titles.
Currently i am trying to use a regex but its not working, here's what i am doing right now.

$reg = "/^Series\..*information$/";
$str = $html;
preg_match_all($reg, $str, $matches);
echo "<pre>";
echo "</pre>";

I don't know much about making regular expressions. Help would appreciated. Thanks

Answer Source


 preg_match_all('/(Series.+?information)/', $str, $matches );


As I said in the comments, remove the literal \. dot and the start and end anchors... I would also use a non-greedy require any character. .+?

Otherwise you could match this


if the casing of Series or information may change such as

Series .... Information

Add the /i flag as in

     preg_match_all('/(Series.+?information)/i', $str, $matches );

The outer capture group isn't really needed, but I think it looks nicer with it in there, if you just want the variable content without the Series or Information then move the capture ( ) to that bit.

 preg_match_all('/Series(.+?)information/i', $str, $matches );

Note you'll want to trim() the match because it will likely have spaces at the beginning and end or add them to the regx like this.

 preg_match_all('/Series\s(.+?)\sinformation/i', $str, $matches );

But that will exclude matching Series information with one space.

If you want to be sure you don't match over an information such as

[Series Hell In Heaven information Series Hell In Heaven information]

Matching all of that you can use a positive lookbehind

preg_match_all('/(Series.+?(?<=information))/i', $str, $matches );

Conversely, if there is a possibility it will contain two information words

   <a href="">
        Series information is power information

You can do this

    preg_match_all('/(Series[^<]+)</i', $str, $matches );

Which will match up to the < as in </a

AS a Side note you could use the PHPQuery library ( which is a DOM parser ), and look for an a tag that contains those words.


Using something like

  $tags = $doc->getElementsByTagName("a:contains('Series)")->text();

This is an excellent library for parsing HTML

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download