Adam Adam - 1 year ago 813
Linux Question

Fail to run tensorflow on GPU

I fail to run the TF-CUDA tutorials_example_trainer as given in the installation guide (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md#installing-from-sources)

I've had problems with the CUDA libs before, but that was with graphics related demo's.

All details below,
Thank you in advance for the help provided.

Environment info



Operating System: Debian Stretch

Installed version of CUDA and cuDNN:
8.0, 5.0

If installed from source, provide


  1. 554ddd9ad2d4abad5a9a31f2d245f0b1012f0d10

  2. Build label: 0.3.0
    Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
    Build time: Fri Jun 10 11:38:23 2016 (1465558703)



Steps to reproduce




  1. Build from source with 367.35 driver

  2. Run bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu



Logs or other output that would be helpful



bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
modprobe: ERROR: ../libkmod/libkmod-module.c:832 kmod_module_insert_module() could not find module by name='nvidia_367_uvm'
modprobe: ERROR: could not insert 'nvidia_367_uvm': Unknown symbol in module, or unknown parameter (see dmesg)
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: debian
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: debian
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.35.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.35 Mon Jul 11 23:14:21 PDT 2016
GCC version: gcc version 5.4.0 20160609 (Debian 5.4.0-6)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.35.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 367.35.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
F tensorflow/cc/tutorials/example_trainer.cc:125] Check failed: ::tensorflow::Status::OK() == (session->Run({{"x", x}}, {"y:0", "y_normalized:0"}, {}, &outputs)) (OK vs. Invalid argument: Cannot assign a device to node 'y': Could not satisfy explicit device specification '/gpu:0' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0
[[Node: y = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/gpu:0"](Const, x)]])

Answer Source

The error message indicates that your GPU driver is not well set. You could try the following command to see if the driver is installed correctly.

$ nvidia-dmi

If not please follow the instruction on the CUDA official site and reinstall CUDA. As your OS is not officially supported, you may want to change your OS.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download