I want to run a Neural Network on mobile. Currently, I am exploring Mxnet (http://mxnet.io) framework for deploying it (only for Inference). As I am concerned about the execution time performance on mobile, I want to know if it runs on the GPU of mobile phones (Android/iOS). The documentation mentions that it can use multiple CPUs as well as GPUs for training, but it is still not clear if it can GPU of mobile phone for inference on mobile. It mentions about dependency on BLAS, because of which it seems it uses CPU on mobile. Could anyone please tell me if I can use mobile GPU with mxnet for inference? If not, what are my other options?
I went through an exercise of creating a neural net application for Android using Deeplearning4j. Because Deeplearning4j is based on Java, I thought it would be a good match for Android. Based on my experience with this, I can answer some of your questions.
To answer your most basic question:
Could anyone please tell me if I can use mobile GPU with mxnet for inference?
The answer is: No. The explanation for this follows.
It mentions about dependency on BLAS, because of which it seems it uses CPU on mobile.
BLAS (Basic Linear Algebraic Subprograms) is at the heart of AI computation. Because of the sheer amount of number-crunching involved in these complex models the math routines must be optimized as much as possible. The computational firepower of GPUs make them ideal processors for AI models.
It appears that MXNet can use Atlas (libblas), OpenBLAS, and MKL. These are CPU-based libraries.
Currently the main (and — as far as I know — only) option for running BLAS on a GPU is CuBLAS, developed specifically for NVIDIA (CUDA) GPUs. Apparently MXNet can use CuBLAS in addition to the CPU libraries.
The GPU in many mobile devices is a lower-power chip that works with ARM architectures which doesn't have a dedicated BLAS library yet.
what are my other options?
Just go with the CPU. Since it's the training that's extremely compute-intensive, using the CPU for inference isn't the show-stopper you think it is. In OpenBLAS, the routines are written in assembly and hand-optimized for each CPU it can run on. This includes ARM.
Do the recognition on the server. After working on another demo application which sent an image to the server that performed the recognition and returned results to the device, I think this approach has some benefits for the user such as better overall response time and not having to find space to install a 100MB(!) application.
Since you also tagged iOS, using a C++-based framework like MXNet is probably the best choice if you are trying to go cross-platform. But their method of creating one giant .cpp file and providing a simple native interface might not give you enough flexibility for Android. Deeplearning4j has solved that problem pretty elegantly by using JavaCPP which takes the complexity out of JNI.