Juris Juris - 9 days ago 3
PHP Question

How to parse a string without losing plus sign in PHP?

I am parsing HTML strings to get values in PHP and write them in database. Here is an example string:

<b>Adress:</b> 22 Examplary road, Nowhere <br>
<b>Phone:</b> +371 12345678, +371 23456789<br>
<b>E-mail: </b>info@example.com<br>


The string can be formatted in random manners. It can contain additional keys that I am not parsing out and it can contain duplicate keys. It can also contain only some of the keys that I am interested in or be completely empty. HTML can also be broken (example tag:
<br
). I have decided that I will follow the rules that entries are separated by
\n
and are in the form
key: value
+ some HTML.

First, I use this code to make the string parseable:

$parse = strip_tags($string);
$parse = str_replace(':', '=', $parse);
$parse = str_replace("\n", '&', $parse);
$parse = str_replace("\r", '', $parse);
$parse = str_replace("\t", '', $parse);


My string looks something like this now:

Adress= 22 Examplary road, Nowhere&Phone= +123 12345678, +123 23456789&E-mail= info@example.com


Then I use
parse_str()
to get the values and then I take out the values if the needed keys are found:

parse_str($parse, $values);

$address = null;
if (isset($values['Adress']))
$address = trim($values['Adress']);

$phone = null;
if (isset($values['Phone']))
$phone = trim($values['Phone']);


The problem is that I end up with
$phone = '371 12345678, 371 23456789'
- I lose the
+
signs. How to conserve those?

Also, if you have any hints how to improve this procedure, I would be glad to know that. Some entries have
Website: example.com
, others have
Web Site example.com
... I am pretty sure that it will not be possible to automatically parse all of the information but I am looking for the best possible solution.

Solution



Using tips provided by WEBjuju I am now using this:

preg_match_all('/([^:]*):\s?(.*)\n/Usi', $string, $matches, PREG_SET_ORDER);

$values = [];
foreach ($matches as $match)
{
$key = strip_tags($match[1]);
$key = trim($key);
$key = mb_strtolower($key);
$key = str_replace("\s", '', $key);
$key = str_replace('-', '', $key);

$value = strip_tags($match[2]);
$value = trim($value);

$descriptionValues[$key] = $value;
}


This allows me to go from this input:

<b>Venue:</b> The Hall<br
<b>Adress:</b> 22 Examplary road, Nowhere <br>
<b>Phone:</b> +371 12345678<br>
<b>E-mail: </b>info@hkliepaja.lv<br>
<b>Website:</b> <a href="http://example.com/" target="_blank">example.com</a><br>


To a nice PHP array with homogenized and hopefully recognizable keys:

[
'venue' => 'The Hall',
'adress' => '22 Examplary road, Nowhere',
'phone' => '+371 12345678',
'email' => 'info@example.com',
'website' => 'example.com',
];


It still doesn't account for the cases of missing colons, but I don't think I can solve that...

Answer

Realizing that you have preformed HTML that conforms to a simple standard structure I can tell you that regular expression matching will be the best way to grab this data. Here is an example to get you on your way - I'm sure it doesn't solve everything, but it solves what your issue is on this post, where you are troubled with "finding key/var matches".

// now go get those matches!
preg_match_all('/<b>([^:]*):\s?<\/b>(.*)<br>/Usi', $string, $matches, PREG_SET_ORDER);
die('<pre>'.print_r($matches,true));

That will output, for instance, something like this:

Array
(
  [0] => Array
    (
        [0] => <b>Adress:</b> 22 Examplary road, Nowhere <br>
        [1] => Adress
        [2] =>  22 Examplary road, Nowhere
    )

  [1] => Array
    (
        [0] => <b>Phone:</b>  +371 12345678, +371 23456789<br>
        [1] => Phone
        [2] =>   +371 12345678, +371 23456789
    )

  [2] => Array
    (
        [0] => <b>E-mail: </b>info@example.com<br>
        [1] => E-mail
        [2] => info@example.com
    )

And from there, I'd have to guess that you can putt that in for par.