Looking to create a condition to look for a % of characters in every word within a block of text. I'm thinking regexp_matches might be what I need but I'm not sure how to code it to look for a % occurrence for each word/phrase.
In the past I've used this below to pinpoint spaces within the entire block of text and then calculate a % for that over the length of the block:
(select count(*) from regexp_matches(table.blocktext, ' ', 'gi')
One way to do this:
SELECT c, count(*) AS ct, (count(*) * 100) / length($1) AS pct FROM unnest(string_to_array($1, NULL)) c WHERE length($1) > 7 -- maybe a length condition of 8 or more GROUP BY 1 HAVING count(*) > (length($1) * 80) / 100 -- 80 is your % threshold ORDER BY 2 DESC;
$1 being the string to be analyzed.
Returns characters that make up more than 80 percent of the given text. Can only be a single row for percentages 50 and above, obviously. Or nothing if no character is used frequently enough.
(count(*) * 100) / length($1) is just the fastest way to round and calculate with integer division. If you want precise results use
count(*) >= (length($1) * 80) / 100.0 instead (also note
>= here). (
100.0 coerces the result of the calculation to be
numeric and precise.)
You can easily wrap this into a function or prepared statement and pass string and percentage as parameters. Like so:
CREATE OR REPLACE FUNCTION f_char_pct(_word text, _pct int) RETURNS boolean AS $func$ SELECT EXISTS ( SELECT 1 FROM unnest(string_to_array(_word, NULL)) c GROUP BY c HAVING count(*) > (length(_word) * _pct) / 100 ) $func$ LANGUAGE sql IMMUTABLE;
"If there is a single word in the given document with more than 7 characters and a character percentage below given threshold, return
SELECT NOT EXISTS ( SELECT 1 FROM unnest(string_to_array('1000000000000z abc 1234567890', ' ')) word WHERE length(word) > 7 AND NOT f_char_pct(word, 80) );
true for '1000000000000z abc 2222222' because 'abc' is ignored and the other two words have > 80 % of the same char.
false for NULL input btw.
You can also wrap the whole thing in a single function ...
unnest(string_to_array($1, ' ')) splits words at every space. More sophisticated definitions are possible. You might use the text search infrastructure. Consider:
BTW, I would replace your simple count:
select count(*) from regexp_matches(table.blocktext, ' ', 'gi')
.. with this equivalent, but much faster one:
SELECT length(table.blocktext) - length(replace(table.blocktext, ' ', ''))
Regular expression functions are powerful, but for a price. And the aggregation step makes it much more expensive yet. Related: