Ailton Ailton - 3 months ago 8
Python Question

Python: How remove duplicates words in string that are not next each other?

In the example below, I need to remove only the third "animale" which is alone in the string. How can I do that?

a = 'animale animale eau toilette animale'


Second "animale": dont remove

Third "animale": remove

Answer

If i understand your question correctly, you want to remove any occurrences of words that are duplicates but not adjacent. I think this solution works for that:

from collections import defaultdict

def remove_duplicates(s):
    result = []
    word_counts = defaultdict(int)
    words = s.split()
    # count the frequency of each word
    for word in words:
        word_counts[word] += 1
    # loop through all words, and only add to result if either it occurs only once or occurs more than once and the next word is the same as the current word.
    for i in range(len(words)-1):
        curr_word = words[i]
        if word_counts[curr_word] > 1:
            if words[i+1] == curr_word:
                result.append(curr_word)
                result.append(curr_word)
                word_counts[curr_word] = -1    # mark as -1 so as not to add again
                i += 1       # skip the next word by incrementing i manually because it has already been added
            # if there are only two occurrences of the word left but they aren't adjacent, add one and mark the counts so you don't add it again.
            elif word_counts[curr_word] < 3:
                result.append(curr_word)
                word_counts[curr_word] = -1    # mark as -1 so as not to add again
            # not adjacent but more than 2 occurrences left so decrement number of occurrences left
            else:
                word_counts[curr_word] -= 1 
        elif word_counts[curr_word] == 1:
            result.append(curr_word)
            word_counts[curr_word] = -1
    # Fix off by one error by checking last index
    if word_counts[words[-1]] == 1:
        result.append(words[-1]) 
    return ' '.join(result)

I think this works for any case where the repeated words aren't adjacent including @Dartmouth's example of 'animale animale eau toilette animale eau eau'.

Sample inputs and outputs:

 Inputs                                               Outputs
 =============================================       =========================================
'animale animale eau toilette animale'                  ---->     'animale animale eau toilette'
'animale animale eau toilette animale eau eau'          ---->     'animale animale toilette eau eau'
'animale eau toilette animale eau eau'                  ---->     'animale toilette eau eau' 
'animale eau toilette animale eau de eau de toilette'   ---->     'animale toilette eau de'
'animale animale eau toilette animale eau eau compte'   ---->     'animale animale toilette eau eau compte'
Comments