aesthaddicts aesthaddicts - 3 months ago 42
Node.js Question

NodeJS stream exceeds heap

I am trying to massage some data from a ~400mb csv file and save it into a database for local querying. It's the ip2location lite database that's freely available and the database I'm trying to import it into is the embedded nedb.

require('dotenv').load()

const fs = require('fs')
const csv = require('csv-parse')
const es = require('event-stream')
const Datastore = require('nedb')
const BatchStream = require('batch-stream')

const db = new Datastore({ filename: process.env.DB_PATH, autoload: true })
const debug = require('debug')('setup')

function massage ([ipLo, ipHi, cc, country, area, city, lat, lng]) {
return { ipLo, ipHi, cc, country, area, city, lat, lng }
}

function setup () {
let qty = 0

return new Promise((resolve, reject) => {
fs.createReadStream(process.env.IP2LOCATION_PATH)
// read and parse csv
.pipe(csv())
// batch it up
.pipe(new BatchStream({ size: 100 }))
// write it into the database
.pipe(es.map((batch, cb) => {
// massage and persist it
db.insert(batch.map(massage), _ => {
qty += batch.length
if (qty % 100 === 0)
debug(`Inserted ${qty} documents…`)
cb.apply(this, arguments)
})
}))
.on('end', resolve)
.on('error', reject)
})
}

module.exports = setup

if (!module.parent) {
debug('Setting up geo database…')
setup()
.then(_ => debug('done!'))
.catch(err => debug('there was an error :/', err))
}


After about 75000 entries I get the following error:

<--- Last few GCs --->

80091 ms: Mark-sweep 1372.0 (1435.0) -> 1371.7 (1435.0) MB, 1174.6 / 0 ms (+ 1.4 ms in 1 steps since start of marking, biggest step 1.4 ms) [allocation failure] [GC in old space requested].
81108 ms: Mark-sweep 1371.7 (1435.0) -> 1371.6 (1435.0) MB, 1017.2 / 0 ms [last resort gc].
82158 ms: Mark-sweep 1371.6 (1435.0) -> 1371.6 (1435.0) MB, 1049.9 / 0 ms [last resort gc].


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x4e36fec9e31 <JS Object>
1: substr [native string.js:~320] [pc=0xdab4e7f1185] (this=0x35500e175a29 <Very long string[65537]>,Q=50,am=65487)
2: __write [/Users/arnold/Develop/mount-meru/node_modules/csv-parse/lib/index.js:304] [pc=0xdab4e7b8f98] (this=0x350ff4f97991 <JS Object>,chars=0x35500e175a29 <Very long string[65537]>,end=0x4e36fe04299 <false>,callback=0x4e36fe04189 <undefined>)
3: arguments adaptor fra...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
1: node::Abort() [/usr/local/Cellar/node/6.3.1/bin/node]
2: node::FatalException(v8::Isolate*, v8::Local<v8::Value>, v8::Local<v8::Message>) [/usr/local/Cellar/node/6.3.1/bin/node]
3: v8::Utils::ReportApiFailure(char const*, char const*) [/usr/local/Cellar/node/6.3.1/bin/node]
4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [/usr/local/Cellar/node/6.3.1/bin/node]
5: v8::internal::Factory::NewByteArray(int, v8::internal::PretenureFlag) [/usr/local/Cellar/node/6.3.1/bin/node]
6: v8::internal::TranslationBuffer::CreateByteArray(v8::internal::Factory*) [/usr/local/Cellar/node/6.3.1/bin/node]
7: v8::internal::LCodeGenBase::PopulateDeoptimizationData(v8::internal::Handle<v8::internal::Code>) [/usr/local/Cellar/node/6.3.1/bin/node]
8: v8::internal::LChunk::Codegen() [/usr/local/Cellar/node/6.3.1/bin/node]
9: v8::internal::OptimizedCompileJob::GenerateCode() [/usr/local/Cellar/node/6.3.1/bin/node]
10: v8::internal::Compiler::GetConcurrentlyOptimizedCode(v8::internal::OptimizedCompileJob*) [/usr/local/Cellar/node/6.3.1/bin/node]
11: v8::internal::OptimizingCompileDispatcher::InstallOptimizedFunctions() [/usr/local/Cellar/node/6.3.1/bin/node]
12: v8::internal::StackGuard::HandleInterrupts() [/usr/local/Cellar/node/6.3.1/bin/node]
13: v8::internal::Runtime_StackGuard(int, v8::internal::Object**, v8::internal::Isolate*) [/usr/local/Cellar/node/6.3.1/bin/node]
14: 0xdab4e60961b
15: 0xdab4e7f1185
16: 0xdab4e7b8f98
[1] 18102 abort npm run setup


What exactly happens? Isn't the whole point of the Stream API not having to have lots of data in memory all at once but being able to process it piece by piece? It looks like the error is coming directly from the csv parse library, is that correct?

Answer

After some debugging I found out that the memory link was in a third party library I used (specifically nedb). It was also not meant for storing that many documents I suppose, so I decided replacing it.

Some articles I found useful chasing down this problem:

Comments