LanderTaker LanderTaker - 2 months ago 14
PHP Question

Preg_match_all not giving the same results as preg_match

I have been trying to get all file resources inside an HTML.

My current version of the regex is

"[^']*'([^"]*)'[^']*" | "([^"]*)"


An example HTML (only a part):

<div style="background-image: url('/courses/UMASGRUPOBDEMO/document/learning_path/El_Contrato_de_Seguro-_Contenido_Teorico/video_pres_cto_seguro.jpg');display: block; margin-left: auto; margin-right: auto;"></div>

<img class="maximize"
src="/courses/CURSODESTINOPEQUENO/document/learning_path/LECCION_1_2_3_4_5_-_corta/Diapositiva01-29332.jpg" style="display: block; margin-left: auto; margin-right: auto;" />


Iterating preg_match I can get:


  • /courses/UMASGRUPOBDEMO/document/learning_path/El_Contrato_de_Seguro-_Contenido_Teorico/video_pres_cto_seguro.jpg

  • maximize

  • /courses/CURSODESTINOPEQUENO/document/learning_path/LECCION_1_2_3_4_5_-_corta/Diapositiva01-29332.jpg



But preg_match_all only give me the next one:


  • /courses/UMASGRUPOBDEMO/document/learning_path/El_Contrato_de_Seguro-_Contenido_Teorico/video_pres_cto_seguro.jpg




You can live test it at http://www.phpliveregex.com/p/h6T


Does this have any sense? Probably my regex needs something to work.

I have not much experience with regex. Please help me :)

Thanks you in advance!

Added:

The regex actually is something like:


  • any string delimited by double quotes which contains any string without double quotes and also contains two quotes inside with an optional content in between them

  • OR two double quotes with optional content inside (without double quotes)



As I am seeing, maybe the no quotes and no double quotes conditions should be touched a little to have better regex...

Now using a longer HTML example: http://www.phpliveregex.com/p/h74

<p><img class="maximize" src="/courses/UMASGRUPOBDEMO/document/learning_path/Diapositiva54/Diapositiva2.jpg" style="display: block; margin-left: auto; margin-right: auto;" alt="" /></p>

<div style="background-image: url('/courses/UMASGRUPOBDEMO/document/learning_path/El_Contrato_de_Seguro-_Contenido_Teorico/video_pres_cto_seguro.jpg');display: block; margin-left: auto; margin-right: auto;"></div>

<img class="maximize"
src="/courses/CURSODESTINOPEQUENO/document/learning_path/LECCION_1_2_3_4_5_-_corta/Diapositiva01-29332.jpg" style="display: block; margin-left: auto; margin-right: auto;" />

Answer

Try this regex instead:

"[^"']*'([^"']*)'[^"']*"|"([^"]*)"

Your original regex was greedily picking up everything from after the second ' to the last " in the input.

Remember that the * and + operators in regex are greedy meaning they will consume as much as possible in order to match.

You either must limit what those operators are applied to (as I did above) or turn them into non-greedy operators for the regex systems that support it, by using *? or +?:

"[^']*?'[^"]*?'[^']*?"

(However, this last one will still have issues, for example with <img src="foo" alt='bar' class="myimage" /> - which will grab 'bar' even though it's not part of a "-delimited string)