I want to find title of pages from a huge haystack but that do not have any class or unique id, so i can't use DOM parser here, i am aware i must use regular expressions.
Here is example of what i am trying to find:
Series Hell In Heaven information
Series What is going information
 => Series Hell In Heaven information
 => Series What is going information
$reg = "/^Series\..*information$/";
$str = $html;
preg_match_all($reg, $str, $matches);
preg_match_all('/(Series.+?information)/', $str, $matches );
As I said in the comments, remove the literal
\. dot and the start and end anchors... I would also use a non-greedy require any character.
Otherwise you could match this
if the casing of Series or information may change such as
Series .... Information
/i flag as in
preg_match_all('/(Series.+?information)/i', $str, $matches );
The outer capture group isn't really needed, but I think it looks nicer with it in there, if you just want the variable content without the Series or Information then move the capture
( ) to that bit.
preg_match_all('/Series(.+?)information/i', $str, $matches );
Note you'll want to
trim() the match because it will likely have spaces at the beginning and end or add them to the regx like this.
preg_match_all('/Series\s(.+?)\sinformation/i', $str, $matches );
But that will exclude matching
Series information with one space.
If you want to be sure you don't match over an information such as
[Series Hell In Heaven information Series Hell In Heaven information]
Matching all of that you can use a positive lookbehind
preg_match_all('/(Series.+?(?<=information))/i', $str, $matches );
Conversely, if there is a possibility it will contain two information words
<a href="http://example.com/123"> Series information is power information </a>
You can do this
preg_match_all('/(Series[^<]+)</i', $str, $matches );
Which will match up to the
< as in
AS a Side note you could use the PHPQuery library ( which is a DOM parser ), and look for an
a tag that contains those words.
Using something like
$tags = $doc->getElementsByTagName("a:contains('Series)")->text();
This is an excellent library for parsing HTML