Chaklader Chaklader - 1 month ago 5
Java Question

How to split a sentence in words while keeping some compound expressions containing white spaces?

I need to split a String on white spaces but I need to ignore some compound keywords which contain white spaces. For example, I have a String as following,

String testCase = "The patient is currently being treated for Diabetes with Thiazide diuretics";


I need the String to be split but need
Thiazide diuretics
as a whole compound expression after

String[] array = testCase.split(" ");


The result needs to be as following:


The
patient
is
currently
being
treated
for
Diabetes
with
Thiazide diuretics



How to do that ?

Answer

You need to deal with the regex directly in this case, .split() is not fit* for your purpose.

String s = "The patient is currently being treated for Diabetes with Thiazide diuretics";

Matcher m = Pattern.compile("\\b(?:Thiazide diuretics)\\b|\\S+").matcher(s);
ArrayList<String> result = new ArrayList<>();
while (m.find()) {
    result.add(m.group());
}
System.out.println(result);
// [The, patient, is, currently, being, treated, for, Diabetes, with, Thiazide diuretics]

Note: Technically it is possible to do so with .split() using lookarounds:

String s = "Thiazide not-a-keyword diuretics and Thiazide diuretics keyword";

String[] result = s.split("(?<!Thiazide) | (?!diuretics)");
System.out.println(Arrays.toString(result));
// [Thiazide, not-a-keyword, diuretics, and, Thiazide diuretics, keyword]

But this doesn't scale when you have got more keywords. Try to avoid this.

Comments