DutchKev DutchKev - 2 months ago 11
Node.js Question

Loop over binary Float64Array file - NodeJS

I have 100 CSV files, each about 50.000.000 rows and each containing 3 cells.

Each row needs to trigger an event to do some calculations.
With the npm read-line lib, that reads the CSV though a pipe, I could get about 1.000.000 cycles of that process per second (1 node thread).

But this process does a lot of steps, just to get some numbers


  1. Open .csv file stream

  2. Stringify each chunk

  3. Search for new line \n in chunk

  4. Split that line into an array (3 cells)

  5. parseFloat every cell



So to parse them even faster, I though converting the csv file to a binary file could help. So I created a binary Float64Array buffer file, as all values in the cells are floating numbers.

let counter = 0 ;;
rows.forEach(function (row) {

row.forEach(function(cell) {

buffer.writeDoubleLE(cell, counter++ * Float64Array.BYTES_PER_ELEMENT);
})
});
writeStream.write(buffer)
writeStream.end()


Then it only has to do this steps


  1. Open .csv file stream

  2. Convert every stream buffer chunk (chunk = 3 cells) to ArrayBuffer to Array64Float

    fs.createReadStream(fileName, {highWaterMark: 24})
    //.pause()
    .on('data', chunk => {
    //this._stream.pause();

    this._bufferOffset = 0;

    this.emit('tick', new Float64Array(chunk.buffer, chunk.byteOffset, chunk.byteLength / Float64Array.BYTES_PER_ELEMENT));
    })
    .on('close', () => {
    let nextFile = this._getNextBINFilePath();

    if (!nextFile) {
    return this.emit('end');
    }

    this._initTestStream();
    })



All good so far. I can read the binary file and parse its contents row by row in a Float64Array.

But for some reason it seems slower than reading a csv (text) file, splitting it by line, splitting it by comma, do parseFloat over the cells.

Am I not seeing the bigger picture of binary, buffers and TypedArrays?

Thanks

Answer

I think the bottleneck is new Float64Array for each (small) chunk.

You could use 3 Float64 parameters instead, or work directly on the chunk.

Or use Float64Array on a much larger chunk, and call the function repeatedly using the same Float64Array.

Comments