Martin Martin - 4 months ago 7
Java Question

How to split or parse a string in Java with escape characters

I have case where I need to split a string in Java with various escape characters. The format will be something like:

id:"description",id:"description",....


id: numeric (int)

description: String escaped with
EscapeUtils.escapeJava(input)
, it could contain any
readable characters, including
:
,
,
and even
"
which will be
escaped to
\"
.

So, the
String.split
method wouldn't seem appropiate as it could get issues with descriptions with
,
or
:
. I know I can write some algorithm that will work fine, it is even a nice excersice to do Test Driven Development, but I was wondering if there's some lazy way around it and use some kind of parser that can do this kind of stuff?

My other possible approach is to generate a JSONArray and don't mess with complexity I'm not interested in, but it will requiere one more library dependency which I'm not convinced of incluidng in this module...

So, what I'm asking for is ideas on how this kind of problem can be solved (libraries, with the Java API, etc.).

Answer

It sounds like your string should match this regex:

^(\d+:"([^"\\]|\\.)*"(,(?!$)|$))+$

in which case you can extract the parts into a Map<Integer, String> by writing something like this:

private static final Pattern TOTAL_STRING_PATTERN =
    Pattern.compile("^(\\d+:\"([^\"\\\\]|\\\\.)*\"(,(?!$)|$))+$");
private static final Pattern PARTIAL_STRING_PATTERN =
    Pattern.compile("(\\d+):\"((?:[^\"\\\\]|\\\\.)*)\"");

public Map<Integer, String> parse(final String input) {
    if(! TOTAL_STRING_PATTERN.matcher(input).matches()) {
        throw new IllegalArgumentException();
    }
    final Map<Integer, String> ret = new HashMap<Integer, String>();
    final Matcher m = PARTIAL_STRING_PATTERN.matcher(input);
    while(m.find()) {
        final Integer id = Integer.valueOf(m.group(1));
        final String description = StringEscapeUtils.unescapeJava(m.group(2));
        ret.put(id, description);
    }
    return Collections.unmodifiableMap(ret);
}

(You may also want to check for the case that the identifier is outside the range of an int, and for the case that the same identifier appears multiple times in the string, and so on. And you may want to make your patterns more flexible in some respect, e.g., allowing whitespace around colons and commas. But the above should be a good start.)