yusuf yusuf - 11 months ago 148
Python Question

Enumerating and replacing all tokens in a string file in python

I have a question for you, dear python lovers.

I have a corpus file, as the following:

Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon 's coordinator in here ?
Excuse me , aren 't you Chae Yoon 's coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?

I want to assign a specific number for each token, and replace it with the assigned number on the file.

What I mean by saying token is, basically each group of characters in the file separated by
' '
. So, for example,
is a token, also
is a token as well.

I have a corpus file which involves more than 4 million lines, as above. Can you show me a fastest way to do I want?


Answer Source

Might be overkill but you could write your own classifier:

# Python 3.x
class Classifier(dict):
    def __init__(self, args = None):
        '''args is an iterable of keys (only)'''
        self.n = 1
        if args:
            for thing in args:
                self[thing] = self.n
    def __setitem__(self, key, value = None):
##        print('setitem', key)
        if key not in self:
            super().__setitem__(key, self.n)
            self.n += 1
    def setdefault(self, key, default = None):
        increment = key not in self
        n = super().setdefault(key, self.n)
        self.n += int(increment)
##        print('setdefault', n)
        return n
    def update(self, other):
        for k, v in other:
    def transpose(self):
        return {v:k for k, v in self.items()}


c = Classifier()
with open('foo.txt') as infile, open('classified.txt', 'w+') as outfile:
    for line in infile:
        line = (str(c.setdefault(token)) for token in line.strip().split())
        outfile.write(' '.join(line))

To reduce the number of writes you could accumulate lines in a list and use writelines() at some set length.

If you have enough memory, you could read the entire file in and split it then feed that to Classifier.


z = c.transpose()
with open('classified.txt') as f:
    for line in f:
        line = (z[int(n)] for n in line.strip().split())
        print(' '.join(line))

For Python 2.7 super() requires arguments - replace super() with super(Classifier, self).

If you are going to be working mainly with strings for the token numbers, in the class you should convert self.n to a string when saving it then you won't have to convert back and forth between strings and ints in your working code.

You also may be able to use LabelEncoder from sklearn.