mnv mnv - 5 months ago 16
PHP Question

How to cut string from start to second last dot of the string?

I have some string, for example:

cats, e.g. Barsik, are funny. And it is true. So,


And I want to get as result:

cats, e.g. Barsik, are funny.


My try:

mb_ereg_search_init($text, '((?!e\.g\.).)*\.[^\.]');
$match = mb_ereg_search_pos();


But it gets position of second dot (after word "true").

How to get desired result?

Answer

Since a naive approach works for you, I am posting an answer. However, please note that detecting a sentence end is a very difficult task for a regex, and although it is possible to some degree, an NLP package should be used for that.

Having said that, I suggested using

'~(?<!\be\.g)\.(?=\s+\p{Lu})~ui'

The regex matches any dot (\.) that is not preceded with a whole word e.g (see the negative lookbehind (?<!\be\.g)), but that is followed with 1 or more whitespaces (\s+) followed with 1 uppercase Unicode letter \p{Lu}.

See the regex demo

The case insensitive i modifier does not impact what \p{Lu} matches.

The ~u modifier is required since you are working with Unicode texts (like Russian).

To get the index of the first occurrence, use a preg_match function with the PREG_OFFSET_CAPTURE flag. Here is a bit simplified regex you supplied in the comments:

preg_match('~(?<!т\.н)(?<!т\.к)(?<!e\.g)\.(?=\s+\p{L})~iu', $text, $match, PREG_OFFSET_CAPTURE);

See the lookaheads are executed one by one, and at the same location in string, thus, you do not have to additionally group them inside a positive lookahead. See the regex demo.

IDEONE demo:

$re = '~(?<!т\.н)(?<!т\.к)(?<!e\.g)\.(?=\s+\p{L})~iu';
$str = "cats, e.g. Barsik, are funny. And it is true. So,"; 
preg_match($re, $str, $match, PREG_OFFSET_CAPTURE);
echo $match[0][1];
Comments