I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.
x = 'hello world'
tokens = nltk.word_tokenize(x)
>>> ['hello', 'world']
I think you are looking for is the
Apparently this is not supported by the default tokenizer.
Here is a code example with another tokenizer.
from nltk.tokenize import WhitespaceTokenizer s = "Good muffins cost $3.88\nin New York." span_generator = WhitespaceTokenizer().span_tokenize(s) spans = [span for span in span_generator] print(spans)
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36)]
just getting the offsets:
offsets = [span for span in spans] [0, 5, 13, 18, 24, 27, 31]
For further information (on the different tokenizers available) see the tokenize api docs