zihan meng zihan meng - 1 month ago 6
Python Question

How to remove lines from a file that are sub-strings of other lines

Here is a file where many lines are sub-strings of other lines. How could I filter it to include only the longest version of each line?

buffer not
buffer not available
code 000001
error pxa_no_shared_memory
error pxa_no_shared_memory occurred
error pxa_no_shared_memory occurred short
error pxa_no_shared_memory occurred short dump
failed return
failed return code
failed return code 000001
for pxa
for pxa buffer
for pxa buffer not
for pxa buffer not available
initialization runt
initialization runt failed
initialization runt failed return
initialization runt failed return code
initialization runt failed return code 000001
memory for
memory for pxa
memory for pxa buffer
memory for pxa buffer not
memory for pxa buffer not available
not available
occurred short
occurred short dump


If the short phrase occurs in a longer phrase, like "buffer not" also occurs in "buffer not available" and "memory for pxa buffer not available", I WANT TO KEEP THE "memory for pxa buffer not available".

The output should be a text file containing all the longest error messages.
Like this:

error pxa_no_shared_memory occurred short dump
initialization runt failed return code 000001
memory for pxa buffer not available

mkj mkj
Answer

Not sure about efficiency but:

with open('lines.txt') as f:
    original = f.read().splitlines()
    results = set(original)
    for o in original:
        for r in set(results):
            if o != r:
                try:
                    if o in r:
                        results.remove(o)
                    elif r in o:
                        results.remove(r)
                except KeyError:
                    pass

print('\n'.join(results))
Comments