I'm trying to figure out how to go about determining the most used words on a mysql dataset.
Not sure how to go about this or if there's a simpler approach. Read a couple posts where some suggests an algorithm.
From 24,500 records, find out the top 10 used words.
Right, this runs like a dog and is limited to working with a single delimiter, but hopefully will give you an idea.
SELECT aWord, COUNT(*) AS WordOccuranceCount FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(SomeColumn, ' '), ' ', aCnt), ' ', -1) AS aWord FROM SomeTable CROSS JOIN ( SELECT a.i+b.i*10+c.i*100 + 1 AS aCnt FROM integers a, integers b, integers c) Sub1 WHERE (LENGTH(SomeColumn) + 1 - LENGTH(REPLACE(SomeColumn, ' ', ''))) >= aCnt) Sub2 WHERE Sub2.aWord != '' GROUP BY aWord ORDER BY WordOccuranceCount DESC LIMIT 10
This relies on having a table called integers with a single column called i with 10 rows with the values 0 to 9. It copes with up to ~1000 words but can easily be altered to cope with more (but will slow down even more).