User987 User987 - 1 month ago 8
C# Question

Algorithm for finding three most common word arrays being repeated in multiple sentences

I'm making out a couple of ideas to make an algorithm that would find 3 most common words in multiple sentences. What do I mean by that? Let's have a look at the example below, let's say I have 3 sentence like as follows:

1. New Samsung Galaxy S7 Edge SM-G935FD Duos 12MP 4G (FACTORY UNLOCKED) 32GB Phone
2. Samsung Galaxy S7 32GB G930P (GSM Unlocked) 4G LTE 12MP Smartphone Black A
3. New Samsung Galaxy S7 SM-G930FD Duos 5.1'' 12MP (FACTORY UNLOCKED) 32GB Phone


The algorithm determines that the 3 most commons words (all next to eachother) are: "Samsung galaxy S7".

My idea (I believe this is the most simplest one that can be implemented) is to take out the first 3 words from the first sentence and start out like that. So for example:

1st loop I get these 3 combinations of words: New Samsung Galaxy
2nd loop I get these 3 combinations of words (excluding the first word in the sentence): Samsung galaxy S7...

So on goes the process till the first sentence (string) ends.

Now my question to you guys is:


  1. Is this a good way to do like I mentioned above?

  2. Are there Algorithms out there that could do the same thing, but are more efficient when time factor comes in question (ie. they work faster)?



Can someone help me out with this? Thanks ! :)

Answer

No, there is no fastest way because to find the three most common words in the string array you must scan the lines to check the possible match.
But there is an improvment: if the three words are unique in the strings (there is only one Samsung Galaxy S7 per sentence) and you want exit as soon as you have find the first string of most common words you can make the following control:

if(counter == array.length)
   return mostCommonWords

This because if the three words are present in all string of the array you know that the other word groups will have at maximum the same counter. But this control only work if the three words are unique per sentence and you want get the first most common occurence

Comments