Yahya Uddin Yahya Uddin - 2 months ago 10
PHP Question

Using regex to extract tag names and values

I want to be able to extract the tag names and values of queries.

Given the following query:

title:(Harry Potter) abc def author:'John' rating:5 jhi cost:"2.20" lmnop qrs


I want to be able to extract the following information:

title => Harry Potter
author => John
rating => 5
cost => 2.20
rest => abc def jhi lmnop qrs


Note that a tag value can be contained in an '..', "..." or (...). It dosent matter which.

This problem was solved using the following:

$query = "..."; // User input

while (preg_match(
'@(?P<key>title|author|rating|cost):(?P<value>[^\'"(\s]+)@',
$query,
$matches
)) {
echo $matches['key'] . " => " . $matches['value'];
$query = trim(str_replace($matches[0], '', $query));
}

while (preg_match(
'@(?P<key>title|author|rating|cost):[\'"(](?P<value>[^\'")]+)[\'")]@',
$query,
$matches
)) {
echo $matches['key'] . " => " . $matches['value'];
$query = trim(str_replace($matches[0], '', $query));
}


Now this is okay for many cases. However, there are quite a few corner cases:

1) For example consider:

title:(John's) abc


should go to:

title => John's
rest => abc


but instead goes to

title => (John'
rest => s) abc


2) Also consider:

title: (foo (: bar)


should go to:

title => foo (: bar


goes to:

rest => (foo (bar)


How can I do this? Is regex even the best way to go? How else can I solve this issue?

UPDATE Fixed a mistake on one of the expected outputs

Answer

It's not possible to parse everything exactly with one regex like you do because you don't have the same rule for all your pairs (key, value). Indeed, a close parenthesis, for instance, would be accepted in the middle of the tag author but not in the middle of title. A single quote mark would be accepted in the middle of title but not in the middle of author, etc. So, even if your rule works in most of the case, your second capture group cannot be properly defined.

One way to improve your solution would be to use different regular expression for each tags. You could then do something like this :

$str   = "title:(foo (: bar) abc def ".
         "author:'John' "             .
         "rating:5 jhi "              .
         "cost:\"2.20\""              .
         "lmnop qrs ";


$regex = array(
  "title"  => "/(?P<key>title):[[:space:]]*\((?P<value>[^\)]*)\)/"       ,
  "author" => "/(?P<key>author):[[:space:]]*'(?P<value>[^']*)'/"         ,
  "rating" => "/(?P<key>rating):[[:space:]]*(?P<value>[\d]+)/"           ,
  "cost"   => "/(?P<key>cost):[[:space:]]*\"(?P<value>[\d]+\.[\d]{2})\"/"
  );

foreach($regex as $k => $r)
{
  if(preg_match($r, $str, $matches))
  {
    echo $matches['key'] . " => " . $matches['value'] . "\n";
  }
  else
  {
    echo "Nothing found for " . $k . "\n";
  }
}

However, note that this solution is not bullet proof. For example, you'll have a problem if the title of a book contains the string author: 'JOHN'.

In my opinion, the best way to avoid such issue is to define a grammatical rule for your input string and to reject all the strings that doesn't mach you rule. Well, it also depends on your requirements and on your application I guess.


Edit

Note that a tag value can be contained in an '..', "..." or (...). It dosent matter which

In that case, your problem is still that

[\'\"\(](?P<value>[^\'\"\)]+)[\'\"\)]

is incorrect. Instead, you want that each pairs of delimiters match. There's an option in subpattern for that (reference here)

(?|\'(?P<value>[^\']+)\'|\"(?P<value>[^\"]+)+\"|\((?P<value>[^\)]+)\))

If you use \ as escape char, the code becomes

$str   = 'title:"foo \" bar" abc def '.
         'author:(Joh\)n) '           .
         'rating:\'5\\\'4\' jhi '     .
         'cost:"2.20"'                .
         'lmnop qrs ';

$regex = "/(?P<key>title|author|rating|cost):[[:space:]]*" . 
         "(?|" . 
             "\"(?P<value>(?:(?:\\\\\")|[^\"])+)\"" . "|" . // matches "..." 
             "\'(?P<value>(?:(?:\\\\\')|[^\'])+)\'" . "|" . // matches '...'
             "\((?P<value>(?:(?:\\\\\))|[^\)])+)\)" .       // matches (...)
         ")/"; // close (?P<value>...


while(preg_match($regex, $str, $matches))
{
  echo $matches['key'] . " => " $matches['value'] . "\n";
  $str = str_replace($matches[0], '', $str);
}

Output

title => foo \" bar
author => Joh\)n
rating => 5\'4
cost => 2.20