Donatas Veikutis Donatas Veikutis - 1 month ago 11
PHP Question

regex for html attributes, need fix

Need to fix this regex which extract html attributes in array for me by preg_mach_all function in php:

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?


the attributes example is:

style="width: 462px;" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......=" data-filename="Screenshot from 2016-02-09 21:54:47.png"


working example in finddle: https://regex101.com/r/QE9XGD/1

because of equals sign in the end of
src
attribute, I got wrong array:

Array
(
[0] => Array
(
[0] => style="width: 462px;"
[1] => src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......=" data-filename="
)

[1] => Array
(
[0] => style
[1] => src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......
)

[2] => Array
(
[0] => width: 462px;
[1] => data-filename=
)

)


correct array should be like this:

Array
(
[0] => Array
(
[0] => style="width: 462px;"
[1] => src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......="
[2] => data-filename="Screenshot from 2016-02-09 1:54:47.png"
)

[1] => Array
(
[0] => style
[1] => src
[2] => data-filename
)

[2] => Array
(
[0] => width: 462px;
[1] => data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAg4AAALoCAYAAAAQpn2mAAAABHNCSVQICAgIfAhkiAAAABl0RVh0U29mdHdhcmUAZ25vbWUtc2NyZWVuc2hvdO8Dv4AACAASURBVHic7L15fNTVufj/PjOTyWSyTfaEJBD2EJBNQFQEtFVRXMD7VQG1dfu2tLW92t77unaxam+t9nbTXze9tW61Vdqvgre9FXcqUHFBFiUEkX0PgSQkmf1zzu+Pzz6ZhBBwg3l4kZn5fM7yPM8553me85znnAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAYykIEMZCADGchABjKQgQxkIAMZyEAGMpCBDGQgAxnIQAqIiy66SDXM/SW7DyUQgEIBAiFAKTOZQn8p7N/OQhB6PgFCgUI43ull6mmwyhUolFWJMB.......=
[2] => Screenshot from 2016-02-09 1:54:47.png
)

)


how to fix this regex to get correct answer?

Remember I use this regex not just in image attributes extraction, is a universal regex for all type of html tags

Answer

(\S+?)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

The change is to make the attribute name evaluation lazy, so it only eats until it finds an =.

Working example on regex101

That being said, I'm fairly confident this regex can be reduced.


([^\s=]+)=('?)("?)([^>"']*)\2\3 is probably the best option:

It takes about 2% of the time of lazy evaluation and will do both singly and doubly quoted attributes. The big change here is the capture groups you want are the 1st and 4th. As far as I'm aware this will work on any html except: tag='"value'

regex101