joefromct joefromct - 1 month ago 7
Python Question

list of strings - remove commonalities of common strings

I'm struggling to come up with a way to solve this problem (and how to come up with a name for this question on stackoverflow).

I need to somehow remove commonalities of strings preserving the remainder.

Given a list such as the following:

l = ('first',
'first.second',
'first.second.third',
'a.b.c',
'x.y.z',
'x.y')


I'm hoping to have a list output as:

('first',
'second',
'third',
'a.b.c',
'x.y',
'z' )


as you can see, when
first.second
subtracts
first
and remains
second
.
first.second
is subtracted from
first.second.third
we get
third
.
a.b.c
doesn't have anything to subtract from itself, so it remains.
x.y
is subtracted from
x.y.z
and
z
remains.

I'm thinking maybe a
sort(key=len)
will be part of the solution, as well as some sort of recursion to end up with the string remainder. I'm hoping for a clean way to remove commonalities of each string in the list.

Answer

I've thought about this interesting problem some more and came up with a solution.

The problem is fundamentally tree structured, regardless of which tree-like technique you end up using:

  • an actual tree datatype (which is how I initially solved it, but it was much more verbose)
  • recursion
  • simulating recursion using stacks (which is what I ultimately ended doing).

I'll use the following list of expanded list of words since it points out some edge cases that make other solutions fail:

li = ['First',
      'FirstSecond',
      'FirstSecondFirst',
      'FirstSecondFirstly',
      'FirstSecondThird',
      'FirstFoo',
      'FirstBarracudaSphyraena',
      'xyz',
      'xy',
      '12345',
      '123',
      '1',
      'FireTruckGarage',
      'FireTruck',
      'Fire']

The trick is in noticing that, every time there's a potential lengthening of the prefix, we must save the previous prefix on the prefix stack (called prefixes here), which contains all prefixes seen so far that haven't yet been exhausted. After we've done with some of the "deeper" words—in the sense of being nodes deeper down the tree—we may need to backtrack and use an old prefix for some shorter word yet again.

After encountering a word that is not prefixed by the current prefix, we must pop the stack of prefixes until we reach one that does prefix the word and continue from there.

def ambiguate(words):
    output = {}
    prefixes = ['']
    prefix = prefixes[0]

    for item in sorted(set(words)):
        backtracked = False

        while not item.startswith(prefix):
            prefix = prefixes.pop()
            backtracked = True

        # If we have backtracked, we put back the current prefix onto the
        # prefix stack since we may have to use it later on.
        if backtracked:
            prefixes.append(prefix)

        # Construct new output and a new prefix and append it to the
        # prefix stack
        output[item] = item.replace(prefix, '', 1)
        prefix = item
        prefixes.append(prefix)

    return output

Running:

print(ambiguate(li))

Yields:

{'1': '1',                                       
 '123': '23',                                    
 '12345': '45',                                  
 'Fire': 'Fire',                                 
 'FireTruck': 'Truck',                           
 'FireTruckGarage': 'Garage',                    
 'First': 'First',                               
 'FirstBarracudaSphyraena': 'BarracudaSphyraena',
 'FirstFoo': 'Foo',                              
 'FirstSecond': 'FirstSecond',                   
 'FirstSecondFirst': 'First',                    
 'FirstSecondFirstly': 'ly',                     
 'FirstSecondFourth': 'Fourth',                  
 'FirstSecondThird': 'FirstSecondThird',         
 'a': 'a',                                       
 'abc': 'bc',                                    
 'xy': 'xy',                                     
 'xyz': 'z'}
Comments