HalfPintBoy HalfPintBoy - 3 months ago 10x
Python Question

Tokenization of text file in python 3.5

I'm trying to do tokenization of words in a text file using python 3.5 but have a couple of errors. Here is the code:

import re
f = open('/Users/Half_Pint_Boy/Desktop/sentenses.txt', 'r')
for line in f:
b=re.split('[^a-z]', line.lower())
a+=len(filter(None, b))
c = c + 1
d = d + b
print (a)
print (c)

My questions:

  1. Construction
    a+=len(filter(None, b))
    works fine in python 2.7 but in 3.5 it cause an error of type that object of:

    type 'filter' has no

    How can it be solved using python 3.5?

  2. When I'm doing tokenization, my code counts also empty spaces as word-tokens. How can I delete them?


  1. You need an explicit cast to list in Python 3.5 to get the length of your sequence, as filter returns an iterator object and not a list as with Python 2.7:

    a += len(list(filter(None, b)))
    #         ^^
  2. The empty spaces where returned from your re.split, e.g.:

    >>> line  = 'sdksljd sdjsh 1213hjs sjdks'
    >>> b=re.split('[^a-z]', line.lower())
    >>> b
    ['sdksljd', 'sdjsh', '', '', '', '', 'hjs', 'sjdks']

You can remove them using a filter on if in a list comprehension on the results from your re.split like so:

b = [i for i in re.split('[^a-z]', line.lower()) if i]

The if i part in the list comp. returns False for an empty string because bool('') is False. So empty strings are cleared.

The results from the list comprehension can also be achieved with filter (which you already used with a):

b = list(filter(None, re.split('[^a-z]', line.lower()))) # use the list comprehension if you don't like brackets

And finally, a can be computed after any of the two approaches as:

a += len(b)