Arjun Arjun - 3 months ago 19
Python Question

I need a python regex to tokenize the sentences upon finding a "\\n"

I used a document converter to get the text from pdf. The text appears in the form:


"Hello Programmers\\nToday we will learn how to create a program in
python\\nThefirst task is very easy and the level will exponentially
increase\\nso please bare in mind that this course is not for the
weak hearted\\n"


I am using NLTK to tokenize the document into sentence upon occurrence of
\\n
. I have used the below regex, but it doesn't work.

Please excuse me if the regex is wrong. I am new to it and there's no time to learn as I have to deliver the code asap.

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'^[\n]')

>>> tokens
[]


..

#tokenizer = RegexpTokenizer('\\n')

>>> tokens
['\n']
>>>


Even using
\\n
did not work. Someone please suggest a correct regex.

Answer

Hey you need to use gaps

>>> tokenizer = RegexpTokenizer(r'\\n', gaps=True)
>>> tokenizer.tokenize(s)
['Hello Programmers', 'Today we will learn how to create a program in python', 'Thefirst task is very easy and the level will exponentially increase', 'so please bare in mind that this course is not for the weak hearted']

A RegexpTokenizer splits a string into substrings using a regular expression. A RegexpTokenizer can use its regexp to match delimiters instead using gaps=True

Comments