Bonson Bonson - 5 months ago 53
Python Question

Get the frequency for each of the ngram terms using sklearn

I am extracting the ngrams from a pandas dataframe using the following method:

def extractNGrams(df, ngram_size, min_freq):
"""Extract NGrams from a list of Strings
Keyword arguments:
df -- the pandas dataframe containing the sentences
ngram_size -- defining the n for ngrams
min_freq --- the minimum frequency for the ngram to be part of the set
"""
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), min_df=min_freq)
lstSentences = df['Text'].values.tolist()
X_train_counts = vect.fit_transform(lstSentences)
vocab = vect.get_feature_names()
#print (vocab)
print (X_train_counts.shape)
return vocab


I wanted to understand the way to get the frequency for each of the ngram terms?

Answer

Posting the code I used for getting counts

train_data_features = X_train_counts.toarray()
vocab = vect.get_feature_names()
dist = np.sum(train_data_features, axis=0)
ngram_freq = {}

# For each, print the vocabulary word and the frequency
for tag, count in zip(vocab, dist):
    #print(tag, count)
    ngram_freq[tag]=count