Benvorth Benvorth - 2 months ago 6
Java Question

Java regex for optionally enclosed csv-string returns unexpected results

In Java I have a string (taken from a csv-file):

40;"blue-collar";"married";"secondary";"no";1100;"yes";"no";"unknown";29;"may";660


My class
CSV_Worker
will split it by the given delimiter (
;
) and removes the quotation marks if necessary:

public class CSV_Worker {

Pattern pattern = null;
int colCount = -1;

public CSV_Worker (String delimiter, int colCount) {
// (?<=^|;)(?:"([^;]*)"|([^;]*))(?=;|$)
this.pattern = Pattern.compile("(?<=^|\\" + delimiter + ")(?:\"([^\\" + delimiter + "]*)\"|([^\\" + delimiter + "]*))(?=\\" + delimiter + "|$)");
this.colCount = colCount;
}

public String [] split (String line) {

String [] result = new String[this.colCount];
Matcher m = pattern.matcher(line);
int idx = 0;
while (m.find()) {
result[idx] = m.group();
idx++;
}
return result;
}
}


Why does
CSV_Worker.split(myString)
return

40
"blue-collar"
"married"
...


instead of

40
blue-collar
married
...


?

Answer

With m.group() you get the whole match (i.e. group 0), not just the content of one of the capturing groups. This includes the quotes from your non-capturing group. Furthermore you use different capturing groups for the case where there are quotes and the case there are no quotes. You therefore need to use the Matcher like this:

String g1 = m.group(1);
result[idx] = (g1 == null ? m.group(2) : g1);

You could also use just a single capturing group by using lookarounds

Pattern pattern = Pattern.compile("(?<=^|\\" + delimiter + ")\"?((?<!\")[^\\" + delimiter + "]*(?!\")|(?<=\")[^\"]*(?=\"))\"?(?=\\" + delimiter + "|$)");

which allows you to use

result[idx] = m.group(1);

in the split method instead.