I'm trying to parse an UTF-16 encoded document using Apache Camel Splitter with xtokenize, this delegates to Woodstox (com.ctc.wstx.sr.BasicStreamReader), also I cannot know the encoding of a file before I read it, currently some files are UTF-16, others UTF-8:
.split().xtokenize(getToken(), 'w', NAMESPACES)
String charset = IOHelper.getCharsetName(exchange);
com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character '�' (code 65533 / 0xfffd) in prolog; expected '<'
Created a Camel JIRA ticket: https://issues.apache.org/jira/browse/CAMEL-11846 From my comments you can see there is no easy solution for splitting UTF-16 XML with Camel without knowing it's UTF-16 in advance.
Though subclassing XMLTokenExpressionIterator, which is an ExpressionAdapter and switching to InputStream works in the first place, there are several other places with xslt & xpath & conversion to StaxSource where it will break for the same reason.
As a workaround I consider it's easier to let XmlStreamReader find out encoding in advance (happens at the initialization) and setting Exchange.CHARSET_NAME header or property.