Paul Johnson Paul Johnson - 2 months ago 20
Python Question

Python - "Undo" text-wrap

I need to take a text and remove the \n character, which I believe I've done. The next task is to remove the hyphen from words where it should not appear but to leave the hyphen in compound words where it should appear. For example, 'encyclo-\npedia to 'encyclopedia' and 'long-\nterm' to 'long-term'. The suggestion is to compare it with an original text.

with open('C:\Users\Paul\Desktop\Comp_Ling_Research_1\BROWN_A1_hypenated.txt', 'rU') as myfile:
data=myfile.read().replace('\n', '')


I have a general idea of what to do but NLP is quite new to me.

Answer

A first pass would be to keep a set of valid words around and de-hyphenate if your de-hyphenated word is in the set of valid words. Ubuntu has a list of valid words at /usr/share/dict/american-english. An overly simple version might look like:

valid_words = set(line.strip() for line in open(valid_words_file))

output = []
for word in open(new_file).read().replace('\n', '').split():
    if '-' in word and word.replace('-', '') in valid_words:
        output.append(word.replace('-', ''))
    else:
        output.append(word)

You would have to deal with punctuation, capitalization, etc., but that's the idea.

Comments