RohitR RohitR - 14 days ago 4
Python Question

Python Word Frequencies with pre-defined words

I have a set of data in a text file and i would like to build a frequency table based on pre-defined words(drive,street,i,lives). below is the example

ID | Text
---|--------------------------------------------------------------------
1 | i drive to work everyday in the morning and i drive back in the evening on main street
2 | i drive back in a car and then drive to the gym on 5th street
3 | Joe lives in Newyork on NY street
4 | Tod lives in Jersey city on NJ street


Here i what i would like to get as output

ID | drive | street | i | lives
----|--------|----------|------|-------
1 | 2 | 1 | 2 | 0
2 | 2 | 1 | 1 | 0
3 | 0 | 1 | 0 | 1
4 | 0 | 1 | 0 | 1


Here is my code that i'm using and i can find the number of words but this does not solve the need for me and i would like to use a set of pre-defined words to find the counts as shown above

from nltk.corpus import stopwords
import string
from collections import Counter
import nltk
from nltk.tag import pos_tag

xy = open('C:\Python\data\file.txt').read().split()
q = (w.lower() for w in xy)

stopset = set(stopwords.words('english'))

filtered_words = [word for word in xyz if not word in stopset]
filtered_words = []
for word in xyz:
if word not in stopset:
filtered_words.append(word)
print(Counter(filtered_words))
print(len(filtered_words))

Answer

Something like sklearn.feature_extraction.text.CountVectorizer seems to be close to what you're looking for. Also, collections.Counter might be helpful. How are you planning to use this data structure? If you're trying to doing machine learning/prediction, by chance, then it's worthwhile to look into the different vectorizers in sklearn.feature_extraction.text.

Edit:

text = ['i drive to work everyday in the morning and i drive back in the evening on main street',
        'i drive back in a car and then drive to the gym on 5th street',
        'Joe lives in Newyork on NY street',
        'Tod lives in Jersey city on NJ street']

from sklearn.feature_extraction.text import CountVectorizer

vocab = ['drive', 'street', 'i', 'lives']

vectorizer = CountVectorizer(vocabulary = vocab)

# turn the text above into a matrix of shape R X C
# where R is number of rows (elements in your text array)
# and C is the number of elements in the set of all words in your text array
X = vectorizer.fit_transform(text)

# sparse to dense matrix
X = X.toarray()

# get the feature names from the already-fitted vectorizer
vectorizer_feature_names = vectorizer.get_feature_names()

# prove that the vectorizer's feature names are identical to the vocab you specified above
assert vectorizer_feature_names == vocab

# make a table with word frequencies as values and vocab as columns
out_df = pd.DataFrame(data = X, columns = vectorizer_feature_names)

print(out_df)

And, your result:

       drive  street  i  lives
    0      2       1  0      0
    1      2       1  0      0
    2      0       1  0      1
    3      0       1  0      1
Comments