Tom Tom - 6 months ago 25
Node.js Question

POSTing XML with Chinese characters to the Microsoft Translator API raises deserializing exception

I'm trying to translate from Chinese (Simplified) to English using the Microsoft Translator API.

A couple of requirements


  • I must use the HTTP method
    POST
    , and not
    GET
    with a query string because my queries exceed Microsoft's URI limit of 15,845 characters (note that this is possible even when I use less than the 10,000 characters limit in the case of Chinese characters. The reason is that the query string has to be URL encoded, which dramatically increases the length, but it is decoded by Microsoft before the character count is determined.

  • The only translate HTTP method that allows
    POST
    s is the
    TranslateArrayMethod
    , e.g. the
    TranslateMethod
    only allows
    GET
    s. Unfortunately, the
    TranslateArrayMethod
    only accepts an XML document, so I must work with XML.



The following is an example of an XML document that I am sending:

<TranslateArrayRequest>
<AppId/>
<From>es</From>
<Options>
<ContentType xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2">text/plain</ContentType>
</Options>
<Texts>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<![CDATA[Hola]]>
</string>
</Texts>
<To>en</To>
</TranslateArrayRequest>


This works fine, the result is:

<ArrayOfTranslateArrayResponse xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<TranslateArrayResponse>
<From>es</From>
<OriginalTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<a:int>4</a:int>
</OriginalTextSentenceLengths>
<TranslatedText>Hello</TranslatedText>
<TranslatedTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<a:int>5</a:int>
</TranslatedTextSentenceLengths>
</TranslateArrayResponse>
</ArrayOfTranslateArrayResponse>


However, if I then add any Chinese character, like so:

<TranslateArrayRequest>
<AppId/>
<From>zh-CHS</From>
<Options>
<ContentType xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2">text/plain</ContentType>
</Options>
<Texts>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<![CDATA[南]]>
</string>
</Texts>
<To>en</To>
</TranslateArrayRequest>


I get a weird response:

<html>
<body/>
<h1>System.Runtime.Serialization.SerializationException</h1>
<p>Message: There was an error deserializing the object of type Microsoft.MT.MDistributor.V2.TranslateArrayRequest. Unexpected end of file. Following elements are not closed: TranslateArrayRequest. Line 1, position 298.</p>
</html>


Note that I also tried not using CDATA escaping, but it doesn't help. Changing the
From
language has no effect either.

I'm working with Node.js (Javascript), although since this is a generic HTTP API I don't think that should matter.

Answer

OK, I encountered exactly the same problem calling one of the Microsoft Translator POST APIs from Node.js. The API works fine - returns the translation as expected - as long as there are no non-ASCII characters, but then when I add a single accented 'é' character to the in appropriate <string> section of the POST body, it responds with an error:

    <html><body/><h1>System.Runtime.Serialization.SerializationException</h1>
<p>Message: There was an error deserializing the object of type Microsoft.MT.MDistributor.V2.TranslateArrayRequest. Unexpected end of file. Following elements are not closed: TranslateArrayRequest. Line 1, position 782.</p>
</html>

I figured out that the problem is that the Content-Length header wants the length in bytes, but I had been sending the length in characters. Why does this happen? Well, the typical way to measure the length of the body for the Node http request is to call

var length = body.length

and get the "length" - i.e. number of characters - of the string. This works when all of the characters are ASCII. However, it turns out that in UTF-8 non-ASCII characters (including my accented 'é') can be more than one byte each. So when the body contains non-ASCII characters the byte length will no longer be equal to the character length, and the character length is incorrect. In this case, it caused the Microsoft server to stop reading the message prematurely, generating the error message.

Instead we need to measure the length in bytes with the call (in Node.js)

var length = Buffer.byteLength(body, 'utf8')

and send that length in Content-Length header, and the Microsoft Translator API works again.