Nander Speerstra Nander Speerstra - 1 year ago 67
Java Question

Multiline RegEx in Java

(My programming question may seem somewhat devious, but I see no other solution.)

A text is written in the editor of Eclipse. By activating a self-made Table view plugin for Eclipse, the text quality is checked automatically by an activated Python script (not editable by me) that receives the editor text. The editor text is stripped from space characters (\n, \t) except the normal space (' '), because otherwise the sentences cannot be QA checked. When the script is done, it returns the incorrect sentences to the table.

It is possible to click on the sentences in the table, and the plugin will search (row-per-row) in the active editor for the clicked sentence. This works for single-line sentences. However, the multiline sentences cannot be found in the active editor, because all the \n and \t are missing in the compiled sentence.

To overcome this problem, I changed the script so it takes the complete editor text as one string. I tried the following:

String newSentence = tableSentence.replaceAll(" ", "\\s+")
Pattern p = Pattern.compile(newSentence)
Matcher contentMatcher = p.matcher(editorContent) // editorContent is a string
if (contentMatcher.find()) {
// Get index offset of string and length of string

By changing all spaces into \s+, I hoped to get the match. However, this does not work because it will look like the following:

  • editorContent: The\nright\n\ttasks.

  • tableSentence: The right tasks.

  • NewSentence: Thes+rights+tasks. // After the 'replaceAll' action

  • Should be: The\s+right\s+tasks.

So, my question is: how can I adjust the input for the compiler?
I am inexperienced when it comes to Java, so I do not see how to change this.. And I unfortunately cannot change the Python script to also return the full sentences...

Answer Source

Add a third and fourth backslash to your regex, so it looks like this: \\\\s+.

Java doesn't have raw (or verbatim) strings, so you have to escape a backslash, so in regex engine it will treat it as a double backslash. This should solve the problem of adding a s+ instead of your spaces.

When you type a regex in code it goes like this:

 |     # Compile time
 |     # regex parsing 
 \s+   # actual regex used

Updated my answer according to @nhahtdh comment (fixed number of backslashes)