D Liebman D Liebman - 1 month ago 23
C++ Question

trouble compiling with custom tensorflow gpu op

I am a newbie at c++, but managed to work out an original new cpu op for tensorflow. Now I would like to work out a op for the gpu. I have a little experience working on open-cl. I am following the guide here:

https://www.tensorflow.org/versions/r0.11/how_tos/adding_an_op/index.html#gpu-support

below is my c++ code, followed by the cuda file. I don't do anything with this code. It compiles correctly but every time I try to run it I get a core dump. For the purpose of debugging I have removed all the contents of my class so that I can focus on the problem. It also says essentially this:

*** Error in `/usr/bin/python': free(): invalid next size (fast): 0x00007fef04033ba0 ***


here is the d_grid_gpu.cc file:

#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/op_kernel.h"

REGISTER_OP("DGridGpu")
.Input("grid: int32")
.Attr("start_x: int = 0")
.Attr("start_y: int = 0")
.Attr("stop_x: int = 28")
.Attr("stop_y: int = 28")
.Attr("size_x: int = 28")
.Attr("size_y: int = 28")
.Attr("wall_height: float = 2.5")
.Output("prev: int32");

using namespace tensorflow;

void run();

class DGridGpuOp : public OpKernel {
public:
explicit DGridGpuOp(OpKernelConstruction* context) : OpKernel(context) {

}

void Compute(OpKernelContext* context) override {
run();
}

};

REGISTER_KERNEL_BUILDER(Name("DGridGpu").Device(DEVICE_GPU), DGridGpuOp);


here is the d_grid_gpu.cu.cc file:

#if GOOGLE_CUDA
#define EIGEN_USE_GPU
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"


// content here

#include <stdio.h>
#define SIZE 1024

__global__ void VectorAdd(int *a, int *b, int *c, int n)
{
int i = threadIdx.x;

if (i < n)
c[i] = a[i] + b[i];
}


void run() {
int *a, *b, *c;
int *d_a, *d_b, *d_c;

a = (int *)malloc(SIZE*sizeof(int));
b = (int *)malloc(SIZE*sizeof(int));
c = (int *)malloc(SIZE*sizeof(int));

cudaMalloc( &d_a, SIZE*sizeof(int));
cudaMalloc( &d_b, SIZE*sizeof(int));
cudaMalloc( &d_c, SIZE*sizeof(int));

for( int i = 0; i < SIZE; ++i )
{
a[i] = i;
b[i] = i;
c[i] = 0;
}

cudaMemcpy( d_a, a, SIZE*sizeof(int), cudaMemcpyHostToDevice );
cudaMemcpy( d_b, b, SIZE*sizeof(int), cudaMemcpyHostToDevice );
cudaMemcpy( d_c, c, SIZE*sizeof(int), cudaMemcpyHostToDevice );

// blocks, threads
VectorAdd<<< 1, SIZE >>>(d_a, d_b, d_c, SIZE);

cudaMemcpy( c, d_c, SIZE*sizeof(int), cudaMemcpyDeviceToHost );

for( int i = 0; i < 10; ++i)
printf("output : c[%d] = %d\n", i, c[i]);

free(a);
free(b);
free(c);

cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
}

#endif


here is the code I use to build the op:

TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')

nvcc -std=c++11 -c -o d_grid_gpu.cu.o d_grid_gpu.cu.cc \
-I $TF_INC -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC --expt-relaxed-constexpr

g++ -std=c++11 -shared -o d_grid_gpu.so d_grid_gpu.cc \
d_grid_gpu.cu.o -I $TF_INC -fPIC -lcudart -D_GLIBCXX_USE_CXX11_ABI=0 -L /usr/lib/x86_64-linux-gnu/


that's all I have. as I say, the cuda code does nothing, but the whole op compiles. I have python code that calls this library which I have not included. I do believe my cuda is working. I am using ubuntu 16.10 and cuda 8

edit -- some of the error before the dump:

*** Error in `/usr/bin/python': free(): invalid next size (fast): 0x00007f34f4033ba0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x790cb)[0x7f35664f20cb]
/lib/x86_64-linux-gnu/libc.so.6(+0x8275a)[0x7f35664fb75a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f35664ff18c]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so(+0x22223a1)[0x7f354d7953a1]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so(+0x222b6a2)[0x7f354d79e6a2]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so(+0x221fd90)[0x7f354d792d90]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so(_ZN5Eigen26NonBlockingThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x3c8)[0x7f354d9f4ce8]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x22)[0x7f354d9f44b2]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb8f0)[0x7f354b0408f0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x770a)[0x7f356684770a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x5f)[0x7f35665810af]
======= Memory map: ========
200000000-200100000 rw-s 3cf997000 00:06 570 /dev/nvidiactl
... more memory map here...


I hope this is helpful. I tried this out and for a moment I thought it was working but I can not reproduce the results.

edit: I have changed my code a little but still get a memory dump.

d_grid_gpu.cc

#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/op_kernel.h"

REGISTER_OP("DGridGpu")
.Input("grid: int32")
.Output("prev: int32");

using namespace tensorflow;

void run(const int * in, int * out);

class DGridGpuOp : public OpKernel {
public:
explicit DGridGpuOp(OpKernelConstruction* context) : OpKernel(context) {


}

void Compute(OpKernelContext* context) override {


Tensor* prev_h = NULL;

const Tensor& grid_h = context->input(0);

auto grid = grid_h.flat<int32>();

OP_REQUIRES_OK(context, context->allocate_output(
0,
TensorShape({64}), &prev_h));

auto prev = prev_h->flat<int32>();

run(grid.data(), prev.data()); // do something to grid_host and move it to prev_host

//exit
}

};

REGISTER_KERNEL_BUILDER(Name("DGridGpu").Device(DEVICE_GPU), DGridGpuOp);
//REGISTER_KERNEL_BUILDER(Name("DGridGpu").Device(DEVICE_CPU), DGridGpuOp);


d_grid_gpu.cu.cc

#if GOOGLE_CUDA
#define EIGEN_USE_GPU
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"

#include <stdio.h>
#define SIZE 20

__global__ void VectorAdd( const int *in, int *out, int n)
{
int i = threadIdx.x;

if (i < n)
out[i] = in[i] + out[i];
}


void run(const int * in, int * out) {

VectorAdd<<< 1, SIZE >>>( in, out, SIZE);

}

#endif

Answer

Briefly, the larger issue is you are attempting to manage memory yourself, but Tensorflow already knows how to do this for you. You should use Tensorflow's mechanisms for managing memory; you do not need any of the malloc, free, cudaMalloc, cudaFree, cudaMemcpy code.

I would start by modifying the GPU kernel from the tutorial:

https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/g3doc/how_tos/adding_an_op/cuda_op_kernel.cc https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/g3doc/how_tos/adding_an_op/cuda_op_kernel.cu.cc

The kernel receives as inputs buffers that are already allocated in GPU memory. You just need to pass their addresses to the GPU kernel.

To allocate a buffer for your output, you should use OpKernelContext::allocate_output() to allocate a Tensor and pass its address to your GPU kernel. There's also an allocate_temp() for allocating temporary buffers. The example above allocates its outputs this way. By default, on GPU, this allocates a buffer in GPU memory. So there no need to allocate memory yourself or copy things from device to host.

You are currently populating an buffer fed as input to your kernel on the host and then copying it to the GPU manually. It might be easiest to either populate that buffer using the GPU, or to use a separate Tensorflow CPU operator to create the input; Tensorflow takes care of host -> device copies when necessary.

I hope this helps!