qed qed - 18 days ago 8
Python Question

How to use mini-batch instead of SGD

Here is a quick implementation of a one-layer neural network in python:

import numpy as np

# simulate data
np.random.seed(94106)
X = np.random.random((200, 3)) # 100 3d vectors
# first col is set to 1
X[:, 0] = 1
def simu_out(x):
return np.sum(np.power(x, 2))
y = np.apply_along_axis(simu_out, 1, X)
# code 1 if above average
y = (y > np.mean(y)).astype("float64")*2 - 1
# split into training and testing sets
Xtr = X[:100]
Xte = X[100:]
ytr = y[:100]
yte = y[100:]
w = np.random.random(3)

# 1 layer network. Final layer has one node
# initial weights,
def epoch():
err_sum = 0
global w
for i in range(len(ytr)):
learn_rate = .1
s_l1 = Xtr[i].T.dot(w) # signal at layer 1, pre-activation
x_l1 = np.tanh(s_l1) # output at layer 1, activation
err = x_l1 - ytr[i]
err_sum += err
# see here: https://youtu.be/Ih5Mr93E-2c?t=51m8s
delta_l1 = 2 * err * (1 - x_l1**2)
dw = Xtr[i] * delta_l1
w -= learn_rate * dw
print("Mean error: %f" % (err_sum / len(ytr)))
epoch()
for i in range(1000):
epoch()

def predict(X):
global w
return np.sign(np.tanh(X.dot(w)))

# > 80% accuracy!!
np.mean(predict(Xte) == yte)


It is using stochastic gradient descent for optimization. I am thinking how do I apply mini-batch gradient descent here?

Answer

The difference from "classical" SGD to a mini-batch gradient descent is that you use multiple samples (a so-called mini-batch) to calculate the update for w. This has the advantage, that the steps you take in direction of the solution are less noisy, as you follow a smoothed gradient.

To do that, you need an inner loop to calculate the update dw, where you iterate over the mini batch. For example (quick-n-dirty code):

def epoch(): 
    err_sum = 0
    learn_rate = 0.1
    global w
    for i in range(int(ceil(len(ytr) / batch_size))):
        batch = Xtr[i:i+batch_size]
        target = ytr[i:i+batch_size]
        dw = np.zeros_like(w)
        for j in range(batch_size):
            s_l1 = batch[j].T.dot(w)
            x_l1 = np.tanh(s_l1)
            err = x_l1 - target[j]
            err_sum += err
            delta_l1 = 2 * err * (1 - x_l1**2)
            dw += batch[j] * delta_l1
        w -= learn_rate * (dw / batch_size)
    print("Mean error: %f" % (err_sum / len(ytr)))

gave an accuracy of 87 percent in a test.

Now, one more thing: you always go through the training set from start to end. You should definitely shuffle the data in each iteration. Always going through in the same order can really affect your performance, especially if you e.g. first have all samples of class A, and then all of class B. This can also make your training go in cycles. So just go through the set in a random order, e.g. with

order = np.random.permutation(len(ytr))

and replace all occurrences of i by order[i] in the epoch() function.

And a more general remark: Global variables are often considered bad design, as you don't have any control over which snippet modifies your variables. Rather pass w as a parameter. The same goes for the learning rate and the batch size.