darris ford darris ford - 4 months ago 11
SQL Question

How to count words in specific column against matching words in another table

I want to be able to:


  • extract specific words from column1 in Table1 - but only the words that are matched from Table2 from a column called word,

  • perform a(n individual) count of the number of words that have been found, and

  • put this information into a permanent table with a format, that looks like:



Final

Word | Count
--------+------
Test | 7
Blue | 5
Have | 2


Currently I have tried this:

INSERT INTO final (word, count)
SELECT
extext
, SUM(dbo.WordRepeatedNumTimes(extext, 'test')) AS Count
FROM [dbo].[TestSite_Info], [dbo].[table_words]
WHERE [dbo].[TestSite_Info].ExText = [dbo].[table_words].Words
GROUP BY ExText;


The function dbo.WordRepeatedNumTimes is:

ALTER function [dbo].[WordRepeatedNumTimes]
(@SourceString varchar(8000),@TargetWord varchar(8000))
RETURNS int
AS
BEGIN
DECLARE @NumTimesRepeated int
,@CurrentStringPosition int
,@LengthOfString int
,@PatternStartsAtPosition int
,@LengthOfTargetWord int
,@NewSourceString varchar(8000)

SET @LengthOfTargetWord = len(@TargetWord)
SET @LengthOfString = len(@SourceString)
SET @NumTimesRepeated = 0
SET @CurrentStringPosition = 0
SET @PatternStartsAtPosition = 0
SET @NewSourceString = @SourceString

WHILE len(@NewSourceString) >= @LengthOfTargetWord
BEGIN
SET @PatternStartsAtPosition = CHARINDEX (@TargetWord,@NewSourceString)
IF @PatternStartsAtPosition <> 0
BEGIN
SET @NumTimesRepeated = @NumTimesRepeated + 1
SET @CurrentStringPosition = @CurrentStringPosition + @PatternStartsAtPosition +
@LengthOfTargetWord
SET @NewSourceString = substring(@NewSourceString, @PatternStartsAtPosition +
@LengthOfTargetWord, @LengthOfString)
END
ELSE
BEGIN
SET @NewSourceString = ''
END
END
RETURN @NumTimesRepeated
END





When I run the above
INSERT
statement, no record is inserted.

In the table TestSite_Info is a column called Extext. Within this column, there is random text - one of the words being 'test'.

In the other table called Table_Words, I have a column called Words and one of the words in there is 'Test'. So in theory, as the word is a match, I would pick it up, put it into the table Final, and then next to the word (in another column) the count of how many times the word has been found within TestSite_Info.Extext.

Table_Words

id|word
--+----
1 |Test
2 |Onsite
3 |Here
4 |As


TestSite_Info

ExText
-------------------------------------------------
This is a test, onsite test , test test i am here


The expected Final table has been given at the top.

-- Update

Now that i have run Abecee block of code this actually works in terms of bringing back a count column and the id relating to the word.

Here are the results :

id|total
--+----
169 |3
170 |0
171 |5
172 |7
173 |1
174 |3


Taken from the following text which it is extracting from :

Test test and i went to and this was a test im writing rubbish hello
but i dont care about care and care seems care to be the word that you will see appear
four times as well as word word word word word, but a .!
who knows whats going on here.

So as you can see, the count for ID 172 appears 7 times (as a reference please see below to what ID numbers relate to in terms of words) which is incorrect it should appear appear 6 times (its added +1 for some reason) as well as ID 171 which is the word care, that appears 4 times but is showing up as 5 times on the count. Any ideas why this would be?
Also what i was really after was a way as you have quite kindly done of the table showing the ID and count BUT also showing the word it relates to as well in the final table, so i don't have to link back through the ID table to see what the actual word is.
Many Thanks for your help on this.

Word|id
--+----
as |174
here |173
word |172
care |171
hello |170
test |169

Answer

You could work along the updated

WITH
Detail AS (
  SELECT
    W.id
    , W.word
    , T.extext
    , (LEN(REPLACE(T.extext, ' ', '  ')) + 2
      - LEN(REPLACE(' '
                    + UPPER(REPLACE(REPLACE(REPLACE(REPLACE(T.extext, ' ', '  '), ':', ' '), '.', ' '), ',', ' '))
                    + ' ', ' ' + UPPER(W.word) + ' ', '')) - 1
      ) / (LEN(W.word) + 2) count
  FROM Table_Words W
  JOIN TestSite_Info T
    ON CHARINDEX(UPPER(W.word), UPPER(T.extext)) > 0
)
INSERT INTO Result
SELECT
  id
  , SUM(count) total
FROM Detail
GROUP BY id
;

(Had forgotten to count in the blanks added to the front and the end, missed a sign change, and got mixed up as for the length of the word(s) surrounded by blanks. Sorry about that. Thanks for testing it more thoroughly than I did originally!)
Tested on SQL Server 2008: Updated SQL Fiddle and 2012: Updated SQL Fiddle.
And with your test case as well.

It:

  • is pure SQL (no UDF required),
  • has room for some tuning:
    • Store words all lower / all upper case, unless case matters (Which would require to adjust the suggested solution.)
    • Store strings to check with all punctuation marks removed.

Please comment if and as further detail is required.