I have this problem where I am using the hostnames of all the URLs I have in my dataset as features. I'm not able to figure out how to use TfidfVectorizer to extract hostnames only from the URLs and calculate their weights.
For instance, I have a dataframe df where the column 'url' has all the URLs I need. I thought I had to do something like:
tfv = TfidfVectorizer(preprocessor=preprocess)
tfv.fit_transform([t for t in df['url']])
You are right.
analyzer=word creates a tokeniser that uses the default token pattern
'(?u)\b\w\w+\b'. If you wanted to tokenise the entire URL as a single token, you can change the token pattern:
vect = CountVectorizer(token_pattern='\S+')
https://www.pythex.org hello hello.there as
['https://www.pythex.org', 'hello', 'hello.there']. You can then create an analyser to extract the hostname from URLs as shown in this question. You can either extend
CountVectorizer to change its
build_analyzer method or just monkey patch it:
def my_analyser(): # magic is a function that extracts hostname from URL, among other things return lambda doc: magic(preprocess(self.decode(doc))) vect = CountVectorizer(token_pattern='\S+') vect. build_analyzer = my_analyser vect.fit_transform(...)
Note: tokenisation is not as simple as is appears. The regex I've used has many limitations, e.g. it doesn't split the last token of a sentence and the first token of the next sentence if there isn't a space after the full stop. In general, regex tokenisers get very unwieldy very quickly. I recommend looking at
nltk, which offers several different non-regex tokenisers.