okutane okutane - 24 days ago 9
Java Question

Replacing lots of substrings in big strings

One of our module's performance relies highly on how we replace substrings in string.

We form a "replacement map" which can contain more than 3500 string pairs and then we apply it with StringUtils.replaceEach(text, searchList, replacementList) to big strings (several MBs).

The keys and values are all unique and in most cases have the same character length (but it's not something we can rely on).

Is there a more sophisticated approach to my task than

StringUtils.replaceEach()
? Something which may be an overkill for simple replacements solved by
replaceEach()
, but which is much faster in my "heavy" case.

Answer

While appendReplacement solution proposed by @zeppelin was surprisingly fast on "heaviest piece of data" it turned out to be a nightmare with bigger map.

The best solution so far turned out to be a composition of what we had (StringUtils.replaceEach) and what was proposed:

protected BackReplacer createBackReplacer(Map<ReplacementKey, String> replacementMap) {
        if (replacementMap.isEmpty()) {
            return new BackReplacer() {
                @Override
                public String backReplace(String str) {
                    return str;
                }
            };
        }

        if (replacementMap.size() > MAX_SIZE_FOR_REGEX) {
            final String[] searchStrings = new String[replacementMap.size()];
            final String[] replacementStrings = new String[replacementMap.size()];

            int counter = 0;
            for (Map.Entry<ReplacementKey, String> replacementEntry : replacementMap.entrySet()) {
                searchStrings[counter] = replacementEntry.getValue();
                replacementStrings[counter] = replacementEntry.getKey().getValue();
                counter++;
            }

            return new BackReplacer() {
                @Override
                public String backReplace(String str) {
                    return StringUtils.replaceEach(str, searchStrings, replacementStrings);
                }
            };
        }

        final Map<String, String> replacements = new HashMap<>();
        StringBuilder patternBuilder = new StringBuilder();

        patternBuilder.append('(');
        for (Map.Entry<ReplacementKey, String> entry : replacementMap.entrySet()) {
            replacements.put(entry.getValue(), entry.getKey().getValue());
            patternBuilder.append(entry.getValue()).append('|');
        }

        patternBuilder.setLength(patternBuilder.length() - 1);
        patternBuilder.append(')');

        final Pattern pattern = Pattern.compile(patternBuilder.toString());

        return new BackReplacer() {
            @Override
            public String backReplace(String str) {
                if (str.isEmpty()) {
                    return str;
                }

                StringBuffer sb = new StringBuffer(str.length());

                Matcher matcher = pattern.matcher(str);
                while (matcher.find()) {
                    matcher.appendReplacement(sb, replacements.get(matcher.group(0)));
                }
                matcher.appendTail(sb);

                return sb.toString();
            }
        };
    }

StringUtils algorithm (MAX_SIZE_FOR_REGEX=0):

type=TIMER, name=*.run, count=8127, min=4.239809, max=4235197.925261, mean=645.736554, stddev=47197.97968925558, duration_unit=milliseconds

appendReplace algorithm (MAX_SIZE_FOR_REGEX=1000000):

type=TIMER, name=*.run, count=8155, min=4.374516, max=7806145.439165999, mean=1145.757953, stddev=86668.38562815856, duration_unit=milliseconds

Mixed solution (MAX_SIZE_FOR_REGEX=5000):

type=TIMER, name=*.run, count=8155, min=3.5862789999999998, max=376242.25076799997, mean=389.68986564688714, stddev=11733.9997814448, duration_unit=milliseconds

Our data:

type=HISTOGRAM, name=initialValueLength, count=569549, min=0, max=6352327, mean=6268.940661478599, stddev=198123.040651236, median=12.0, p75=16.0, p95=32.0, p98=854.0, p99=1014.5600000000013, p999=6168541.008000023
type=HISTOGRAM, name=replacementMap.size, count=8155, min=0, max=65008, mean=73.46108949416342, stddev=2027.471388983965, median=4.0, p75=7.0, p95=27.549999999999955, p98=55.41999999999996, p99=210.10000000000036, p999=63138.68900000023

This change halved time spend in StringUtils.replaceEach in former solution and gave us 25% performance boost in our module which was mostly IO-bound.