I have a database with almost 100,000 comments and I would like to detect the most used words (using stop words to avoid common words).
I want to do this only one time, and then use a few of the most popular words to Tag the comments that contains them.
Can you help me with the Query and PHP code to do this?
The easiest approach I guess would be:
keywords(id, word) and
keywords_comments(keyword_id, comment_id, count)
keywordssaves an unique id and the keyword you found in a text
keywords_commentsstores one row for each connection between each comment that contains that keyword. In
countyou wil save the number of times this keyword occurred in the comment. The two columns keyword_id + comment_id together form a unique or directly the primary key.
You have the following two comments:
Hello, how are you?!
Wow, hello. My name is Stefan.
Now you would iterate over both of them and split them by non-characters. This would result in the following lowercase words for each text: - First text: hello, how, are, you - Second text: wow, hello, my, name, is, stefan
As soon as you have parsed one of this text, you can already insert it into the database again. I guess you do not want to load 100.000 comments to RAM.
So it would go this:
keywordsif it is not there yet
keywords_comments) and set the count correctly (in our example each word occurs only once in each text, you have to count that).
A very easy improvement you probably have to use for 100.000 comments, is to use a counting variable or add a new field has_been_analyzed to each comment. Then you can read them comment by comment from the database.
I usually use counting variables when I read data chunkwise and know that the data cannot not change from the direction I am starting (i.e. it will stay consistent up to the point I currently am). Then I do something like:
SELECT * FROM table ORDER BY created ASC LIMIT 0, 100 SELECT * FROM table ORDER BY created ASC LIMIT 100, 100 SELECT * FROM table ORDER BY created ASC LIMIT 200, 100 …
Consider that this only works if we know for sure that there are no dates to be added at a place we think we already read. E.g. using
DESC would not work, as there could be data inserted. Then the whole offset would break and we would read one article twice and never read the new article.
If you cannot make sure that the outside counting variable stays consistent, you can add a new field analyzed which you set to true as soon as you have read the comment. Then you can always see which comments have already been read and which not. An SQL query would then look like this:
SELECT * FROM table WHERE analyzed = 0 LIMIT 100 /* Reading chunks of 100 */
This works as long as you do not parallelize the workload (with multiple clients or threads). Otherwise you would have to make sure that reading + setting true is atomar (synchronized).