Noman Ali Noman Ali - 4 months ago 8
PHP Question

PHP Regex to find a substring from a big string - Matching start and end

I want to find title of pages from a huge haystack but that do not have any class or unique id, so i can't use DOM parser here, i am aware i must use regular expressions.
Here is example of what i am trying to find:

<a href="http://example.com/xyz">
Series Hell In Heaven information
</a>
<a href="http://example.com/123">
Series What is going information
</a>


Output Should be an array with

[0] => Series Hell In Heaven information
[1] => Series What is going information


All series titles have start with Series and end with information. from a huge string of multiple things i only want to extract titles.
Currently i am trying to use a regex but its not working, here's what i am doing right now.

$reg = "/^Series\..*information$/";
$str = $html;
preg_match_all($reg, $str, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";


I don't know much about making regular expressions. Help would appreciated. Thanks

Answer

Try

 preg_match_all('/(Series.+?information)/', $str, $matches );

As

https://regex101.com/r/oJ0jZ4/1

As I said in the comments, remove the literal \. dot and the start and end anchors... I would also use a non-greedy require any character. .+?

Otherwise you could match this

Seriesinformation

if the casing of Series or information may change such as

Series .... Information

Add the /i flag as in

     preg_match_all('/(Series.+?information)/i', $str, $matches );

The outer capture group isn't really needed, but I think it looks nicer with it in there, if you just want the variable content without the Series or Information then move the capture ( ) to that bit.

 preg_match_all('/Series(.+?)information/i', $str, $matches );

Note you'll want to trim() the match because it will likely have spaces at the beginning and end or add them to the regx like this.

 preg_match_all('/Series\s(.+?)\sinformation/i', $str, $matches );

But that will exclude matching Series information with one space.

If you want to be sure you don't match over an information such as

[Series Hell In Heaven information Series Hell In Heaven information]

Matching all of that you can use a positive lookbehind

preg_match_all('/(Series.+?(?<=information))/i', $str, $matches );

Conversely, if there is a possibility it will contain two information words

   <a href="http://example.com/123">
        Series information is power information
   </a>

You can do this

    preg_match_all('/(Series[^<]+)</i', $str, $matches );

Which will match up to the < as in </a

AS a Side note you could use the PHPQuery library ( which is a DOM parser ), and look for an a tag that contains those words.

https://github.com/punkave/phpQuery

And

https://code.google.com/archive/p/phpquery/wikis/Manual.wiki

Using something like

  $tags = $doc->getElementsByTagName("a:contains('Series)")->text();

This is an excellent library for parsing HTML

Comments