I'm developing a service in NodeJS which will create text files from images using a node wrapper for the tesseract OCR engine. I want it to be a constantly running service, being started and restarted (on crash) by upstart.
I have the option of making the servers (Virtual machines on which this going to run) multiple core machines with large RAM and disk space or I have the option of creating 4 or 5 small VMs with one core each, 1 GB RAM and relatively small disk size.
With the first approach, I would have to fork various child processes to make use of all cores, which adds complexity to the code. On the other hand, I just have one VM to worry about.
With the second approach, I don't have to worry about forking child processes, but I would have to create and configure multiple VMs.
Are there other pros and cons of each approach that I haven't thought of?
I'd avoid partitioning VMs since that means you'll likely end up wasting RAM and CPU -- it's not unlikely that you'll find one VM using 100% of its resources while another sits idle. There's also non-trivial overhead involved in running 5 operating systems instead of one.
Why are you considering forking many processes? If you use the right library, this will be unnecessary.
Many of the tesseract libraries on npm are poorly written. They are ultra-simplistic bindings to the tesseract C code. In JS, you call the addon's
recognize() function, which just calls tesseract's
Recognize(), which does CPU-intensive work in a blocking fashion. This means you're doing the recognition on the main V8 thread, which we know is a no-no. I assume this is why you're considering forking processes, since each would only be able to do a single blocking OCR operation at once.
Instead, you want a library that does the OCR work on a separate thread. tesseract_native is an example. It is properly designed: it uses libuv to call into tesseract on a worker thread.
libuv maintains a worker thread pool, so you can have as many concurrent OCR operations as you have cores, all in a single process.