afropunk afropunk - 21 days ago 9
SQL Question

Look for percentage of characters in a word/phrase within a block of text

Looking to create a condition to look for a % of characters in every word within a block of text. I'm thinking regexp_matches might be what I need but I'm not sure how to code it to look for a % occurrence for each word/phrase.

In the past I've used this below to pinpoint spaces within the entire block of text and then calculate a % for that over the length of the block:

(select count(*) from regexp_matches(table.blocktext, ' ', 'gi')

For instance each block of text could be say 100 to 1000 characters (a paragraph or two). And within those characters I want to say for example for every word do they have an 80% or more occurrence of the same character (could be a letter, number or anything else). So I'm not telling it the character just saying any character that constitutes 80% or more of the word/phrase. I also think I'd have to add a length condition so that it avoid shorter popular words where this might be common (like 'woohoo' etc). So maybe a length condition of 8 or more as well.

I'm hoping this is possible within Postgres, although I haven't managed to find any examples in my searches so far. Any assistance would be much appreciated and many thanks for taking a look.

Answer Source

One way to do this:

SELECT c, count(*) AS ct, (count(*) * 100) / length($1) AS pct
FROM   unnest(string_to_array($1, NULL)) c
WHERE  length($1) > 7  -- maybe a length condition of 8 or more
HAVING count(*) > (length($1) * 80) / 100  -- 80 is your % threshold

$1 being the string to be analyzed.

Returns characters that make up more than 80 percent of the given text. Can only be a single row for percentages 50 and above, obviously. Or nothing if no character is used frequently enough.

(count(*) * 100) / length($1) is just the fastest way to round and calculate with integer division. If you want precise results use count(*) >= (length($1) * 80) / 100.0 instead (also note >= here). (100.0 coerces the result of the calculation to be numeric and precise.)

You can easily wrap this into a function or prepared statement and pass string and percentage as parameters. Like so:

CREATE OR REPLACE FUNCTION f_char_pct(_word text, _pct int)
  RETURNS boolean AS
   FROM   unnest(string_to_array(_word, NULL)) c
   GROUP  BY c 
   HAVING count(*) > (length(_word) * _pct) / 100


"If there is a single word in the given document with more than 7 characters and a character percentage below given threshold, return false. Else true"

   FROM   unnest(string_to_array('1000000000000z abc 1234567890', ' ')) word
   WHERE  length(word) > 7
   AND    NOT f_char_pct(word, 80)

Returns false.
Returns true for '1000000000000z abc 2222222' because 'abc' is ignored and the other two words have > 80 % of the same char.
Returns false for NULL input btw.

dbfiddle here

You can also wrap the whole thing in a single function ...

unnest(string_to_array($1, ' ')) splits words at every space. More sophisticated definitions are possible. You might use the text search infrastructure. Consider:

BTW, I would replace your simple count:

select count(*) from regexp_matches(table.blocktext, ' ', 'gi')

.. with this equivalent, but much faster one:

SELECT length(table.blocktext) - length(replace(table.blocktext, ' ', ''))

Regular expression functions are powerful, but for a price. And the aggregation step makes it much more expensive yet. Related: