rom rom - 1 month ago 7
Python Question

most frequent words in a french text

I am using the python

nltk
package to find the most frequent words in a French text. I find it not really working...
Here is my code:

#-*- coding: utf-8 -*-

#nltk: package for text analysis
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import nltk
import tokenize
import codecs
import unicodedata


#output French accents correctly
def convert_accents(text):
return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')



### MAIN ###

#openfile
text_temp=codecs.open('text.txt','r','utf-8').readlines()

#put content in a list
text=[]
for word in text_temp:
word=word.strip().lower()
if word!="":
text.append(convert_accents(word))

#tokenize the list
text=nltk.tokenize.word_tokenize(str(text))

#use FreqDist to get the most frequents words
fdist = FreqDist()
for word in text:
fdist.inc( word )
print "BEFORE removing meaningless words"
print fdist.items()[:10]

#use stopwords to remove articles and other meaningless words
for sw in stopwords.words("french"):
if fdist.has_key(sw):
fdist.pop(sw)
print "AFTER removing meaningless words"
print fdist.items()[:10]


Here is the output:

BEFORE removing meaningless words
[(',', 85), ('"', 64), ('de', 59), ('la', 47), ('a', 45), ('et', 40), ('qui', 39), ('que', 33), ('les', 30), ('je', 24)]
AFTER removing meaningless words
[(',', 85), ('"', 64), ('a', 45), ('les', 30), ('parce', 15), ('veut', 14), ('exigence', 12), ('aussi', 11), ('pense', 11), ('france', 10)]


My problem is that
stopwords
does not discard all the meaningless words.
For example ',' is not a word and should be removed, 'les' is an article and should be removed.

How to fix the problem?

The text I used can be found at this page:
http://www.elysee.fr/la-presidence/discours-d-investiture-de-nicolas-sarkozy/

Answer

Usually its a better idea to use a list of stopwords of your own. For this purpose, you can get a list of French stopwords from here. The article word 'les' is also on the list. Create a text file of them and use the file to remove stopwords from your corpus. Then for punctuations you have to write a punctuation removal function. How you should write it, highly depends on your application. But just to show you a few examples that would get you started, you can write:

import string
t = "hello, eric! how are you?"
print t.translate(string.maketrans("",""), string.punctuation)

and the output is:

hello eric how are you

or, another way is to simply write:

t = t.split()
for w in t:
    w = w.strip('\'"?,.!_+=-')
    print w

So, it really depends on how you need them to be removed. In certain scenarios these methods might not result in what you exactly wanted. But, you can build on them. Let me know if you had any further questions.

Comments