Josephine Josephine - 2 months ago 16
Python Question

how to remove gibberish that exhibits no pattern using python nltk?

I am writing a code to clean the urls and extract just the underlying text.

train_str = train_df.to_string()
letters_only = re.sub("[^a-zA-Z]", " ", train_str)
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
stops.update(['url','https','http','com'])
meaningful_words = [w for w in words if not w in stops]
long_words = [w for w in meaningful_words if len(w) > 3]


Using the above code i am able to extract just the words after removing the punctuations, stopwords, etc. But i am unable to remove the words that are gibberish in nature. These are some of the many words that i get after cleaning the urls.

['uact', 'ahukewim', 'asvpoahuhxbqkhdtibveqfggtmam', 'fchrisalbon','afqjcnhil', 'ukai', 'khnaantjejdfrhpeza']


There is no particular pattern in their occurrence or in the letters to use regex or other functions. Could anyone suggest any ways in which these words could be removed?
Thanks!

Answer

create an empty list. Loop through all the words in the current list. use words.words() from the corpera to check if it is a real world. Append all the "non-junk words" to that new list. Use that new list for whatever you'd like.

from nltk.corpus import words

test = ['uact', 'ahukewim', 'asvpoahuhxbqkhdtibveqfggtmam', 'fchrisalbon',\
'afqjcnhil', 'ukai', 'khnaantjejdfrhpeza', 'this', 'is' , 'a' , 'word']
final = []

for x in test:
    if x in words.words():
        final.append(x)
print(final)

output:

['this', 'is', 'a', 'word']
Comments