Erik Edgren Erik Edgren - 2 months ago 7
PHP Question

Get the most used words with special characters

I want to get the most used word from an array. The only problem is that the Swedish characters (Å, Ä, and Ö) will only show as �.

$string = 'This is just a test post with the Swedish characters Å, Ä, and Ö. Also as lower cased characters: å, ä, and ö.';
echo '<pre>';
print_r(array_count_values(str_word_count($string, 1, 'àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ')));
echo '</pre>';


That code will output the following:

Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[�] => 1
[�] => 1
[and] => 2
[�] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[�] => 1
[�] => 1
[�] => 1
)


How can I make it to "see" the Swedish characters and other special characters?

Answer

Here is a solution with regex using unicode punctuation to split the "words" then just a regular array occurence count.

array_count_values(preg_split('/[[:punct:]\s]+/u', $string, -1, PREG_SPLIT_NO_EMPTY));

Produces:

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [Å] => 1
    [Ä] => 1
    [and] => 2
    [Ö] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [å] => 1
    [ä] => 1
    [ö] => 1
)

This was tested in a unicode console, you might want to empose a encoding if you are using a browser. Either make a tag or set encoding within your browser, or send PHP headers.