Biggie Biggie - 6 months ago 36
Node.js Question

convert streamed buffers to utf8-string

I want to make a HTTP-request using node.js to load some text from a webserver. Since the response can contain much text (some Megabytes) I want to process each text chunk separately. I can achieve this using the following code:

var req = http.request(reqOptions, function(res) {
...
res.setEncoding('utf8');
res.on('data', function(textChunk) {
// process utf8 text chunk
});
});


This seems to work without problems. However I want to support HTTP-compression, so I use zlib:

var zip = zlib.createUnzip();

// NO res.setEncoding('utf8') here since we need the raw bytes for zlib
res.on('data', function(chunk) {
// do something like checking the number of bytes downloaded
zip.write(chunk); // give the raw bytes to zlib, s.b.
});

zip.on('data', function(chunk) {
// convert chunk to utf8 text:
var textChunk = chunk.toString('utf8');

// process utf8 text chunk
});


This can be a problem for multi-byte characters like
'\u00c4'
which consists of two bytes:
0xC3
and
0x84
. If the first byte is covered by the first chunk (
Buffer
) and the second byte by the second chunk then
chunk.toString('utf8')
will produce incorrect characters at the end/beginning of the text chunk. How can I avoid this?

Hint: I still need the buffer (more specifically the number of bytes in the buffer) to limit the number of downloaded bytes. So using
res.setEncoding('utf8')
like in the first example code above for non-compressed data does not suit my needs.

Answer

Streamed Buffers

If you have streamed buffers like in the question above where the first byte of a multi-byte UTF8-character may be contained in the first Buffer (chunk) and the second byte in the second Buffer then you should use a StringDecoder. :

var StringDecoder = require('string_decoder').StringDecoder;

var req = http.request(reqOptions, function(res) {
    ...
    var decoder = new StringDecoder('utf8');

    res.on('data', function(chunk) {
        var textChunk = decoder.write(chunk);
        // process utf8 text chunk
    });
});

This way bytes of incomplete characters are buffered by the StringDecoder until all required bytes were written to the decoder.

Single Buffer

If you have a single Buffer you can use its toString method that will convert all or part of the binary contents to a string using a specific encoding. It defaults to utf8 if you don't provide a parameter, but I've explicitly set the encoding in this example.

var req = http.request(reqOptions, function(res) {
    ...

    res.on('data', function(chunk) {
        var textChunk = chunk.toString('utf8');
        // process utf8 text chunk
    });
});

Another way

As @Zugwalt mentioned:

You can also to chunk.toString('utf8');