Bahramdun Adil Bahramdun Adil - 3 months ago 8
Java Question

How to write a regex to split a String in this format?

I want to use

[,.!?;~]
to split a string, but I want to remain the
[,.!?;~]
to its place for example:


This is the example, but it is not enough


To

[This is the example,, but it is not enough] // length=2
[0]=This is the example,
[1]=but it is not enough


As you can see the comma is still in its place. I did this with this regex
(?<=([,.!?;~])+)
. But I want if some special word (e.g: but) comes after the
[,.!?;~]
, then do not split that part of string. For example:


I want this sentence to be split into this form, but how to do. So if
anyone can help, that will be great


To

[0]=I want this sentence to be split into this form, but how to do.
[1]=So if anyone can help,
[2]=that will be great


As you can see this part (form, but) is not split int the first sentence.

rD. rD.
Answer

I've used:

  1. Positive Lookbehind (?<=a)b to keep the delimiter.
  2. Negative Lookahead a(?!b) to rule out stop words.

Notice how I've appended RegEx (?!\\s*(but|and|if)) after your provided RegEx. You can put all those stop words that you've to rule out (eg, but, and, if) inside the bracket separated by pipe symbol.

Also do notice that the delimiter is still in it's place.

Output

Count of tokens = 3
I want this sentence to be split into this form, but how to do.
So if anyone can help,
that will be great

Code

import java.lang.*;

public class HelloWorld {
    public static void main(String[] args) {
        String str = "I want this sentence to be split into this form, but how to do. So if anyone can help, that will be great";
        //String delimiters = "\\s+|,\\s*|\\.\\s*";
        String delimiters = "(?<=,)";

        // analyzing the string 
        String[] tokensVal = str.split("(?<=([,.!?;~])+)(?!\\s*(but|and|if))");

        // prints the number of tokens
        System.out.println("Count of tokens = " + tokensVal.length);

        for (String token: tokensVal) {
            System.out.println(token);
        }
    }
}