Yahya Uddin Yahya Uddin - 1 month ago 4x
PHP Question

Using regex to extract tag names and values

I want to be able to extract the tag names and values of queries.

Given the following query:

title:(Harry Potter) abc def author:'John' rating:5 jhi cost:"2.20" lmnop qrs

I want to be able to extract the following information:

title => Harry Potter
author => John
rating => 5
cost => 2.20
rest => abc def jhi lmnop qrs

Note that a tag value can be contained in an '..', "..." or (...). It dosent matter which.

This problem was solved using the following:

$query = "..."; // User input

while (preg_match(
)) {
echo $matches['key'] . " => " . $matches['value'];
$query = trim(str_replace($matches[0], '', $query));

while (preg_match(
)) {
echo $matches['key'] . " => " . $matches['value'];
$query = trim(str_replace($matches[0], '', $query));

Now this is okay for many cases. However, there are quite a few corner cases:

1) For example consider:

title:(John's) abc

should go to:

title => John's
rest => abc

but instead goes to

title => (John'
rest => s) abc

2) Also consider:

title: (foo (: bar)

should go to:

title => foo (: bar

goes to:

rest => (foo (bar)

How can I do this? Is regex even the best way to go? How else can I solve this issue?

UPDATE Fixed a mistake on one of the expected outputs


It's not possible to parse everything exactly with one regex like you do because you don't have the same rule for all your pairs (key, value). Indeed, a close parenthesis, for instance, would be accepted in the middle of the tag author but not in the middle of title. A single quote mark would be accepted in the middle of title but not in the middle of author, etc. So, even if your rule works in most of the case, your second capture group cannot be properly defined.

One way to improve your solution would be to use different regular expression for each tags. You could then do something like this :

$str   = "title:(foo (: bar) abc def ".
         "author:'John' "             .
         "rating:5 jhi "              .
         "cost:\"2.20\""              .
         "lmnop qrs ";

$regex = array(
  "title"  => "/(?P<key>title):[[:space:]]*\((?P<value>[^\)]*)\)/"       ,
  "author" => "/(?P<key>author):[[:space:]]*'(?P<value>[^']*)'/"         ,
  "rating" => "/(?P<key>rating):[[:space:]]*(?P<value>[\d]+)/"           ,
  "cost"   => "/(?P<key>cost):[[:space:]]*\"(?P<value>[\d]+\.[\d]{2})\"/"

foreach($regex as $k => $r)
  if(preg_match($r, $str, $matches))
    echo $matches['key'] . " => " . $matches['value'] . "\n";
    echo "Nothing found for " . $k . "\n";

However, note that this solution is not bullet proof. For example, you'll have a problem if the title of a book contains the string author: 'JOHN'.

In my opinion, the best way to avoid such issue is to define a grammatical rule for your input string and to reject all the strings that doesn't mach you rule. Well, it also depends on your requirements and on your application I guess.


Note that a tag value can be contained in an '..', "..." or (...). It dosent matter which

In that case, your problem is still that


is incorrect. Instead, you want that each pairs of delimiters match. There's an option in subpattern for that (reference here)


If you use \ as escape char, the code becomes

$str   = 'title:"foo \" bar" abc def '.
         'author:(Joh\)n) '           .
         'rating:\'5\\\'4\' jhi '     .
         'cost:"2.20"'                .
         'lmnop qrs ';

$regex = "/(?P<key>title|author|rating|cost):[[:space:]]*" . 
         "(?|" . 
             "\"(?P<value>(?:(?:\\\\\")|[^\"])+)\"" . "|" . // matches "..." 
             "\'(?P<value>(?:(?:\\\\\')|[^\'])+)\'" . "|" . // matches '...'
             "\((?P<value>(?:(?:\\\\\))|[^\)])+)\)" .       // matches (...)
         ")/"; // close (?P<value>...

while(preg_match($regex, $str, $matches))
  echo $matches['key'] . " => " $matches['value'] . "\n";
  $str = str_replace($matches[0], '', $str);


title => foo \" bar
author => Joh\)n
rating => 5\'4
cost => 2.20