Rashmi Singh Rashmi Singh - 2 months ago 13
Python Question

Scikit Learn - Extract word tokens from a string delimiter using CountVectorizer

I have list of strings. If any string contains the '#' character then I want to extract the first part of the string and get the frequency count of word tokens from this part of string only. i.e
if the string is "first question # on stackoverflow"
expected tokens are "first","question"

If the string does not contain '#' then return tokens of the whole string.

To compute the term document matrix I am using

from scikit.

Find below my code:

class MyTokenizer(object):
def __call__(self,s):
return s
return s.split('#')[0]
def FindKmeans():
text = ["first ques # on stackoverflow", "please help"]
vec = CountVectorizer(tokenizer=MyTokenizer(), analyzer = 'word')
pos_vector = vec.fit_transform(text).toarray()

output : [u' ', u'a', u'e', u'f', u'h', u'i', u'l', u'p', u'q', u'r', u's', u't', u'u']

Expected Output : [u'first', u'ques', u'please', u'help']


The problem lays with your tokenizer, you've split the string into the bits you want to keep and the bits you don't want to keep, but you've not split the string into words. Try using the tokenizer below

class MyTokenizer(object):
    def __call__(self,s):
            return s.split(' ')
            return s.split('#')[0].split(' ')