scriptin scriptin - 5 months ago 109
Java Question

Preserve XML layout (attribute order, newlines) using StAX to make small changes (e.g. change an attribute)

I'm trying to replace values of some attributes in an SVG file using StAX iterator API. I read an original file using

XMLEventReader
, checking and modifying elements, and then writing into
XMLEventWriter
.

My original file has the following structure:

<?xml version="1.0" encoding="UTF-8"?>
<!--
...
-->
<!DOCTYPE ...
...
]>
<svg ...


The output I get is not the same:

<?xml version="1.0"?><!--
...
--><!DOCTYPE ...
...
]><svg ...


As you can see,
encoding
is gone, as well as newlines around a comment and doctype.

Also, order of all attributes on all tags in the resulting file seems to be random. I've read another question and I'm aware that attribute order is not guaranteed, but this doesn't help me.

These SVG files are under Git, so I'd like to preserve their plain-text layout as much as possible.

How do I fix those issues? With my current task, I could just replace attribute values as plain text, without any parsing, but I would like to have a solution which would allow me to take tag nesting and things like that into account.

If it can't be done with StAX, I'm totally open to different approaches. I've already tried DOM approach, and it's even worse. Maybe there are some 3d-party parsers...

Answer

In cases involving updating attributes, the best option is not using XMLEventWriter, but instead finding positions (character offsets) of tags in XML files and make substring replacements. You can do it like this:

  1. Using XMLEventReader, iterate through a file
  2. When you encounter an element where you want to change attributes, use XMLEvent#getLocation(), and then call getCharacterOffset() on it, which will return the position in the original file, where this event was emitted.
  3. By tracking offset of previous and current elements, you can extract a substring with just one element from the contents of the original file.
  4. Update the substring, join it with the text before and after it, which will get you an updated XML as a string. Since extracted substring contains only one element, you can safely assume that all attributes are unique, so you can add, remove, and update them as you want, without worrying about accidentally touching other elements.
  5. Write updated contents to a file, as string.

Downside: You have to parse attributes manually, but this is trivial in most cases.


Also, I found an issue with Characters events: they are reported after subsequent < or </ is already consumed. For example, in <foo>bar</foo> the bar characters will be reported like bar</.

This may be different in other implementations of StAX, I'm using the default one from Java library. I assume this behavior can be explained by the fact that StAX parser never goes backwards, and when it has enough information to detect an end of characters event, it already consumes the beginning of a next element (opening or closing tag).


As for my original attempts to use XMLEventWriter:

  • Missing encoding on XML header can be added by explicitly constructing a new StartDocument event.
  • Missing newlines can be added manually, but I couldn't find any flag to preserve them. It seems to be related to the issue above: parser reports offsets of those elements together with newline characters (sometimes they are prepended, sometimes appended).
  • Random order of attributes can only be fixed with custom parser, as noted by @vtd-xml-author