Eric Leibenguth Eric Leibenguth - 2 months ago 27x
Python Question

Theano - Shared variable as input of function for large dataset

I am new to Theano... My apologies if this is obvious.

I am trying to train a CNN, based on the LeNet tutorial. A major difference with the tutorial is that my dataset is too large to fit in memory, so I have to load each batch during training.

The original model has this:

train_model = theano.function(
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]

...Which does not work for me, as it assumes that
is entirely loaded in memory.

So I switched to this:

train_model = theano.function([x,y], cost, updates=updates)

And tried to call it with:

data, target = load_data(minibatch_index) # load_data returns typical numpy.ndarrays for a given minibatch

data_shared = theano.shared(np.asarray(data, dtype=theano.config.floatX), borrow=True)
target_shared = T.cast(theano.shared(np.asarray(target, dtype=theano.config.floatX), borrow=True), 'int32')

cost_ij = train_model(data_shared ,target_shared )

But got:

TypeError: ('Bad input argument to theano function with name ":103" at index 0(0-based)', 'Expected an array-like object, but found a Variable: maybe you are trying to call a function on a (possibly shared) variable instead of a numeric array?')

So I guess I can't use a shared variable as an input to a Theano function. But then, how should I proceed...?


All inputs to compiled Theano functions (i.e the output of a call to theano.function(...)) should always be concrete values, typically scalars or numpy arrays. Shared variables are a way to wrap a numpy array and treat it like a symbolic variable but this is not necessary when the data is being passed as an input.

So you should be able to just omit wrapping your data and target values as shared variables and do the following instead:

cost_ij = train_model(data, target)

Note that, if you're using a GPU, this means your data will reside in the computer's main memory, and each part you pass as input will need to be copied to the GPU memory separately, increasing overhead and slowing it down. Also note that you will have to divide your data up and only pass part of it; this change of approach won't allow you to do GPU computations over your whole dataset at once if the whole dataset won't fit in GPU memory.