Tony Tony - 2 months ago 13
Java Question

Using line breaks in String.contains()

I have text like the following:


Grad/Med School University of Osteopathic Medicine and
Health Sci.

this was read from a pdfFile into a String (Java) called pdfFileText. Actually, the above is just a small part of the total text.

I will also have a String called institution. In this case the value of institution is "University of Osteopathic Medicine and Health Sci."

In the PDF file, as you see above, the University name exceeded the line width so it wrapped to the next line.

What I want to do is verify pdfFileText.contains(institution). But since the institution is line-wrapped this will not work.

I tried to make a new String ins = institution.replaceAll(" ", [ \n\r]+);
But that did not work. I also tried various numbers of dashes, up to something like institution.replaceAll(" ", [ \\\\n\\\\r]+); or maybe more backslashes. But nothing seems to work.

What could be the correct regular expression to use? Or perhaps, contains() will not allow regular expressions? Would you suggest trying a pattern matcher? I would still be confused about what to replace the blank spaces with in a pattern.

Answer

You're doing it backwards. Remove the line endings from the input first:

pdfFileText.replaceAll("\\s+", " ").contains(institution)

If you cannot guarantee that institution will always be normalised, then pre-process that as well:

pdfFileText.replaceAll("\\s+", " ")
           .contains(institution.replaceAll("\\s+", " "))

If after testing this turns out to be too slow due to the input size, implement your own contains that just skips extra whitespace while matching.

Comments