Adam Adam - 4 months ago 265
Linux Question

Fail to run tensorflow on GPU

I fail to run the TF-CUDA tutorials_example_trainer as given in the installation guide (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md#installing-from-sources)

I've had problems with the CUDA libs before, but that was with graphics related demo's.

All details below,
Thank you in advance for the help provided.

Environment info



Operating System: Debian Stretch

Installed version of CUDA and cuDNN:
8.0, 5.0

If installed from source, provide


  1. 554ddd9ad2d4abad5a9a31f2d245f0b1012f0d10

  2. Build label: 0.3.0
    Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
    Build time: Fri Jun 10 11:38:23 2016 (1465558703)



Steps to reproduce




  1. Build from source with 367.35 driver

  2. Run bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu



Logs or other output that would be helpful



bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
modprobe: ERROR: ../libkmod/libkmod-module.c:832 kmod_module_insert_module() could not find module by name='nvidia_367_uvm'
modprobe: ERROR: could not insert 'nvidia_367_uvm': Unknown symbol in module, or unknown parameter (see dmesg)
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: debian
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: debian
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.35.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.35 Mon Jul 11 23:14:21 PDT 2016
GCC version: gcc version 5.4.0 20160609 (Debian 5.4.0-6)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.35.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 367.35.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
F tensorflow/cc/tutorials/example_trainer.cc:125] Check failed: ::tensorflow::Status::OK() == (session->Run({{"x", x}}, {"y:0", "y_normalized:0"}, {}, &outputs)) (OK vs. Invalid argument: Cannot assign a device to node 'y': Could not satisfy explicit device specification '/gpu:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
[[Node: y = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/gpu:0"](Const, x)]])

Answer

The error message indicates that your GPU driver is not well set. You could try the following command to see if the driver is installed correctly.

$ nvidia-dmi

If not please follow the instruction on the CUDA official site and reinstall CUDA. As your OS is not officially supported, you may want to change your OS.

Comments