I've been trying to use stochastic gradient descent with sumofsquarederror as cost function to build a neural network using feedforward backpropagation algorithm that is able to represent this training data :
Input Output
{{0,1} , {1,0,0,0,0,0,0,0}}
{{0.1,1}, {0,1,0,0,0,0,0,0}}
{{0.2,1}, {0,0,1,0,0,0,0,0}}
{{0.3,1}, {0,0,0,1,0,0,0,0}}
{{0.4,1}, {0,0,0,0,1,0,0,0}}
{{0.5,1}, {0,0,0,0,0,1,0,0}}
{{0.6,1}, {0,0,0,0,0,0,1,0}}
{{0.7,1}, {0,0,0,0,0,0,0,1}}
output = 1/(1+e^(weightedSum))
errorGradient = learningRate*(outputtrainingData) * output * (1output)*inputUnit;
trainingData
inputUnit
weights of i = weights of i  (learningRate * errorGradient)
package ann;
import java.util.Arrays;
import java.util.Random;
public class MSEANN {
static double learningRate= 0.1;
static double totalError=0;
static double previousTotalError=Double.POSITIVE_INFINITY;
static double[] weights;
public static void main(String[] args) {
genRanWeights();
double [][][] trainingData = {
{{0,1}, {1,0,0,0,0,0,0,0}},
{{0.1,1}, {0,1,0,0,0,0,0,0}},
{{0.2,1}, {0,0,1,0,0,0,0,0}},
{{0.3,1}, {0,0,0,1,0,0,0,0}},
{{0.4,1}, {0,0,0,0,1,0,0,0}},
{{0.5,1}, {0,0,0,0,0,1,0,0}},
{{0.6,1}, {0,0,0,0,0,0,1,0}},
{{0.7,1}, {0,0,0,0,0,0,0,1}},
};
while(true){
int errorCount = 0;
totalError=0;
//Iterate through training set
for(int i=0; i < trainingData.length; i++){
//Iterate through a list of output unit
for (int out=0 ; out < trainingData[i][1].length ; out++) {
double weightedSum = 0;
//Calculate weighted sum for this specific training set and this specific output unit
for(int ii=0; ii < trainingData[i][0].length; ii++) {
weightedSum += trainingData[i][0][ii] * weights[out*(2)+ii];
}
//Calculate output
double output = 1/(1+Math.exp(weightedSum));
double error = Math.pow(trainingData[i][1][out]  output,2)/2;
totalError+=error;
if(error >=0.001){
errorCount++;
}
//Iterate through a the training set to update weights
for(int iii = out*2; iii < (out+1)*2; iii++) {
double firstGrad= ( trainingData[i][1][out]  output ) * output*(1output);
weights[iii] = learningRate * firstGrad * trainingData[i][0][iii % 2];
}
}
}
//Total Error accumulated
System.out.println(totalError);
//If error is getting worse every iteration, terminate the program.
if (totalErrorpreviousTotalError>=0){
System.out.println("FAIL TO CONVERGE");
System.exit(0);
}
previousTotalError=totalError;
if(errorCount == 0){
System.out.println("Final weights: " + Arrays.toString(weights));
System.exit(0);
}
}
}
//Generate random weights
static void genRanWeights() {
Random r = new Random();
double low = 1/(Math.sqrt(2));
double high = 1/(Math.sqrt(2));
double[] result = new double[16];
for(int i=0;i<result.length;i++) {
result[i] = low + (highlow)*r.nextDouble();
}
System.out.println(Arrays.toString(result));
weights = result;
}
}
Input Output
{{0 ,1}, {1,0,0,0,0,0,0,0}},
{{0.1,1}, {0,1,0,0,0,0,0,0}},
// {{0.2,1}, {0,0,1,0,0,0,0,0}},
Your implementation seems to be correct and the problem isn't related to the choice of learning rate.
The problem comes from the limitations of the SingleLayer Perceptron (without hidden layers), that can't solve non linearly separable problems, like the XOR binary operation, unless we use a special activation function that make it work for XOR, but I don't know if a special activation function can make it work for your problem. To solve your problem, you probably will have to choose another layout of neural network like the MultiLayer Perceptron.
The problem you give to the SingleLayer Perceptron is not linearly separable on a 2 dimensions surface. When the input takes only 2 different values, it is possible to separate the outputs with one line. But with 3 or more differents values for the inputs, and the outputs you want, some outputs need two lines to be separated from the other values.
For example, the 2D graph for the second output neuron of your network, and 3 possible values for the input, like in your test:
^

 line 1
  line 2
  
  
0.0  0  1  0
  

+> input values
0.0 0.1 0.2
To separate the 1
from the two 0
s, it needs two lines instead of one. So the second neuron won't be able to produce the desired output.
As the bias always have the same value it doesn't influence the problem and is not present on the graph.
If you change the target outputs to have a linearly separable problem, then the SingleLayer Perceptron will work:
{{0.0, 1}, {1,0,0,0,0,0,0,0}},
{{0.1, 1}, {1,1,0,0,0,0,0,0}},
{{0.2, 1}, {1,1,1,0,0,0,0,0}},
{{0.3, 1}, {1,1,1,1,0,0,0,0}},
{{0.4, 1}, {1,1,1,1,1,0,0,0}},
{{0.5, 1}, {1,1,1,1,1,1,0,0}},
{{0.6, 1}, {1,1,1,1,1,1,1,0}},
{{0.7, 1}, {1,1,1,1,1,1,1,1}},
In some cases, it is possible to introduce arbitrary inputs computed from the true inputs. For example, with 4 values possible for the true inputs:
{{1.0, 0.0, 1}, {1,0,0,0,0,0,0,0}},
{{1.0, 0.1, 1}, {0,1,0,0,0,0,0,0}},
{{ 1.0, 0.2, 1}, {0,0,1,0,0,0,0,0}},
{{ 1.0, 0.3, 1}, {0,0,0,1,0,0,0,0}},
If, for each output neuron, you draw the graph with the true inputs on X axis and the arbitrary inputs on Y axis, you will see, for the 4 points representing the outputs, that the 1
can be separated from the 0
s by only one line.
To handle 8 possible values for the true inputs, you can add a second arbitrary input, and get a 3D graph. Another way to handle 8 possible values without a second arbitrary input is to put the points on a circle. For example:
double [][][] trainingData = {
{{0.0, 0.0, 1}, {1,0,0,0,0,0,0,0}},
{{0.0, 0.1, 1}, {0,1,0,0,0,0,0,0}},
{{0.0, 0.2, 1}, {0,0,1,0,0,0,0,0}},
{{0.0, 0.3, 1}, {0,0,0,1,0,0,0,0}},
{{0.0, 0.4, 1}, {0,0,0,0,1,0,0,0}},
{{0.0, 0.5, 1}, {0,0,0,0,0,1,0,0}},
{{0.0, 0.6, 1}, {0,0,0,0,0,0,1,0}},
{{0.0, 0.7, 1}, {0,0,0,0,0,0,0,1}},
};
for(int i=0; i<8;i++) {
// multiply the true inputs by 8 before the sin/cos in order
// to increase the distance between points, and multiply the
// resulting sin/cos by 2 for the same reason
trainingData[i][0][0] = 2.0*Math.cos(trainingData[i][0][1]*8.0);
trainingData[i][0][1] = 2.0*Math.sin(trainingData[i][0][1]*8.0);
}
If you don't want to, or can't add arbitrary inputs nor modify the target outputs, you will have to choose another layout of neural network like the MultiLayer Perceptron. But maybe a special activation function can solve your problem with a SingleLayer Perceptron. I tried with a Gaussian, but it didn't work, maybe due to wrong parameters.
And in the future, what are the steps you recommend I should to debug effectively for this type of problem?
Think to the limitations of the layout you have chosen and try other layouts. If you choose a MultiLayer Perceptron, think about changing the number of hidden layers, and the number of neurons in these layers.
It is sometimes possible to normalize the inputs and the outputs of the network, in some cases it greatly improve the performances, like in the tests I have done with your training datas. But I think there may be some cases where it is better to have a trained network with the true inputs, whatever the time needed to train the network.
I have tested your training datas with a MultiLayer Perceptron that have one hidden layer of 15 neurons and that doesn't have a sigmoid function for the output neurons. My network converges and stops at the required error after arround 100 000 training cycles with a learning rate of 0.1
.
If I modify the inputs by the following way:
0 > 0
0.1 > 1
0.2 > 2
0.3 > 3
0.4 > 4
0.5 > 5
0.6 > 6
0.7 > 7
Then, my network converge a lot more quickly. And even more quickly if I convert the values to the range [7,7]:
0 > 7
0.1 > 5
0.2 > 3
0.3 > 1
0.4 > 1
0.5 > 3
0.6 > 5
0.7 > 7
It is a little faster if I modify the target outputs, replacing the 0
s by 1
:
{{7,1}, { 1,1,1,1,1,1,1,1}},
{{5,1}, {1, 1,1,1,1,1,1,1}},
{{3,1}, {1,1, 1,1,1,1,1,1}},
{{1,1}, {1,1,1, 1,1,1,1,1}},
{{ 1,1}, {1,1,1,1, 1,1,1,1}},
{{ 3,1}, {1,1,1,1,1, 1,1,1}},
{{ 5,1}, {1,1,1,1,1,1, 1,1}},
{{ 7,1}, {1,1,1,1,1,1,1, 1}},
With this normalization of inputs and outputs, I get the required error after arround 2000 training cycles, against 100 000 without normalization.
Another example is your implementation with the 2 first lines of the training datas, like in your question:
Input Output
{{0 ,1}, {1,0,0,0,0,0,0,0}},
{{0.1,1}, {0,1,0,0,0,0,0,0}},
// {{0.2,1}, {0,0,1,0,0,0,0,0}},
It takes arround 600 000 training cycles to get the required error. But if I use these training datas:
Input Output
{{0 ,1}, {1,0,0,0,0,0,0,0}},
{{1 ,1}, {0,1,0,0,0,0,0,0}},
with 1
instead of the input 0.1
, it take only 9000 training cycles. And moreover, if I use 10
instead of 0.1
and 10
instead of 0
, it takes only 1500 training cycles.
But, unlike for my MultiLayer Perceptron, replacing the 0
s in the target outputs by 1
breaks the performances.