Advertisement

[ANN] Backprop issues

Started by February 25, 2008 06:52 PM
2 comments, last by alvaro 16 years, 11 months ago
Hello to everybody :) I've been learning NN's for some time now and I've developed a NN class which I have successfully trained for (as an example) OR/AND logic function (using 1 neuron). The training was based on the simple 'learning rule':

 new weight = old weight + delta * input
 new bias = old bias - delta

where: delta = target value - output value.
Everything worked fine, no problems there. However if I got it wrong let me know :) So later I decided to advance to multilayer NN and so I did. And I wanted to start with the immortal XOR function. And it turned out I can't train it... I have a network of 3 layers: 2 hidden and 1 output. Both input layers have two neurons and the output layer has one (so 5 neurons in total). All neurons look the same: 2 inputs + bias (with a constant 'input' of -1) and 1 output. I use the sigmoid as the activation function. To train it I needed backprop so I started reading. And I have read a lot. And I've learned that not a single one of the articles I read was clear enough. So I was lured here... ;) (after reading a few posts from this forums) So, the questions: #1 How to change weights? Currently I'm using this equation for all neurons (same for _all_ layers!):

foreach weight:
 weight = weight + learnRate * input * output * (1-output) * delta

where:
 learnRate -> constant, 0.3 for now so I get good (yet slow) results
 input -> the input 'coming in' through this weight
 output -> the output from this entire neuron
 output * (1-output) -> the derivative of the sigmoid function
 delta -> the error value calculated like this:
  for output layer: delta = target - output
  for each previous layer delta gets backproped and multiplied by the weight of the connection it travels through.
If it comes from a few connections it's summed.
This is based and drawn really nice on this site. What's wrong with this algorithm? ---------------------- #2 How to change the bias? At the moment I just do:

bias = bias - learnRate * delta
---------------------- #3 How do I implement momentum? None of the articles I've read clarified what to multiply by the infamous alpha parameter when updating weight number i. Should I use weight(i-1) (the previously updated one)? If so, how to calculate momentum for i=0? ---------------------- To summarize I need to tell you that the network sometimes (VERY seldom) trains OK and does after all work for XOR. And when it does this the output is very good (value for zero is really low and value for one is almost 1). But most of the time it can't manage it. Even after 100000 iterations of backprop... The fun thing is that when I wanted to change XOR to OR and see if it manages (still using a multilayer NN) it almost did. For the OR training set it almost always found the right solution and very seldom failed. So, as you can see something's wrong and I can't find it... I will be very grateful for all your comments :) Regards, QmQ UPDATE I'm also including this dump to make all clear. It's a bit long but it's for demo purposes only.

	BEFORE TRAINING	|  AFTER
0 XOR 0 = 0 (0.389031)	 -> 1 (0.540423)
0 XOR 1 = 0 (0.371058)	 -> 1 (0.507694)
1 XOR 0 = 0 (0.350031)	 -> 1 (0.920854)
1 XOR 1 = 0 (0.333749)	 -> 0 (0.044733)

0 XOR 0 = 0 (0.476952)	 -> 1 (0.590870)
0 XOR 1 = 0 (0.482337)	 -> 1 (0.598985)
1 XOR 0 = 0 (0.426534)	 -> 1 (0.584926)
1 XOR 1 = 0 (0.445482)	 -> 0 (0.244864)

0 XOR 0 = 0 (0.178641)	 -> 0 (0.013935)
0 XOR 1 = 0 (0.192264)	 -> 0 (0.494459)
1 XOR 0 = 0 (0.183975)	 -> 1 (0.983203)
1 XOR 1 = 0 (0.199005)	 -> 0 (0.494931)

0 XOR 0 = 1 (0.519888)	 -> 0 (0.019064)
0 XOR 1 = 0 (0.453202)	 -> 1 (0.980676)
1 XOR 0 = 0 (0.450809)	 -> 1 (0.980603)
1 XOR 1 = 0 (0.399967)	 -> 0 (0.019860)

0 XOR 0 = 0 (0.185017)	 -> 0 (0.226056)
0 XOR 1 = 0 (0.204674)	 -> 1 (0.586697)
1 XOR 0 = 0 (0.183746)	 -> 1 (0.586518)
1 XOR 1 = 0 (0.202097)	 -> 1 (0.587045)

0 XOR 0 = 1 (0.597928)	 -> 0 (0.022980)
0 XOR 1 = 1 (0.627749)	 -> 0 (0.490592)
1 XOR 0 = 1 (0.644793)	 -> 1 (0.982135)
1 XOR 1 = 1 (0.671332)	 -> 0 (0.491087)

0 XOR 0 = 1 (0.647987)	 -> 1 (0.588371)
0 XOR 1 = 1 (0.620041)	 -> 1 (0.866711)
1 XOR 0 = 1 (0.624817)	 -> 1 (0.555343)
1 XOR 1 = 1 (0.594609)	 -> 0 (0.008398)

0 XOR 0 = 1 (0.522973)	 -> 0 (0.019253)
0 XOR 1 = 1 (0.531933)	 -> 1 (0.980488)
1 XOR 0 = 1 (0.529388)	 -> 1 (0.980402)
1 XOR 1 = 1 (0.540660)	 -> 0 (0.020064)

0 XOR 0 = 0 (0.297622)	 -> 1 (0.628855)
0 XOR 1 = 0 (0.313843)	 -> 1 (0.578957)
1 XOR 0 = 0 (0.281573)	 -> 1 (0.566705)
1 XOR 1 = 0 (0.301616)	 -> 0 (0.246367)

0 XOR 0 = 1 (0.578521)	 -> 1 (0.530156)
0 XOR 1 = 1 (0.579873)	 -> 1 (0.870113)
1 XOR 0 = 1 (0.566371)	 -> 1 (0.616105)
1 XOR 1 = 1 (0.562629)	 -> 0 (0.000170)

0 XOR 0 = 0 (0.468334)	 -> 1 (0.531848)
0 XOR 1 = 1 (0.509844)	 -> 1 (0.679352)
1 XOR 0 = 0 (0.442307)	 -> 1 (0.669115)
1 XOR 1 = 0 (0.481913)	 -> 0 (0.145127)

0 XOR 0 = 1 (0.691101)	 -> 1 (0.589911)
0 XOR 1 = 1 (0.712194)	 -> 1 (0.850123)
1 XOR 0 = 1 (0.666724)	 -> 1 (0.513916)
1 XOR 1 = 1 (0.685775)	 -> 0 (0.064833)

0 XOR 0 = 1 (0.546131)	 -> 0 (0.229658)
0 XOR 1 = 1 (0.567433)	 -> 1 (0.894752)
1 XOR 0 = 1 (0.517446)	 -> 0 (0.434778)
1 XOR 1 = 1 (0.535919)	 -> 0 (0.435251)

0 XOR 0 = 1 (0.743372)	 -> 0 (0.291190)
0 XOR 1 = 1 (0.738016)	 -> 1 (0.896570)
1 XOR 0 = 1 (0.725057)	 -> 0 (0.404596)
1 XOR 1 = 1 (0.724783)	 -> 0 (0.405093)

0 XOR 0 = 0 (0.434931)	 -> 0 (0.158733)
0 XOR 1 = 0 (0.420743)	 -> 1 (0.916917)
1 XOR 0 = 0 (0.419672)	 -> 0 (0.457688)
1 XOR 1 = 0 (0.407032)	 -> 0 (0.458170)

0 XOR 0 = 1 (0.718538)	 -> 1 (0.584902)
0 XOR 1 = 1 (0.748483)	 -> 1 (0.869437)
1 XOR 0 = 1 (0.693568)	 -> 1 (0.559624)
1 XOR 1 = 1 (0.715729)	 -> 0 (0.004674)

0 XOR 0 = 0 (0.298970)	 -> 0 (0.020819)
0 XOR 1 = 0 (0.299558)	 -> 0 (0.491469)
1 XOR 0 = 0 (0.295715)	 -> 1 (0.982536)
1 XOR 1 = 0 (0.296965)	 -> 0 (0.491933)

0 XOR 0 = 0 (0.463896)	 -> 1 (0.586986)
0 XOR 1 = 0 (0.479988)	 -> 1 (0.851331)
1 XOR 0 = 0 (0.484199)	 -> 1 (0.504883)
1 XOR 1 = 0 (0.499989)	 -> 0 (0.075276)

0 XOR 0 = 1 (0.799050)	 -> 0 (0.017166)
0 XOR 1 = 1 (0.811284)	 -> 1 (0.984691)
1 XOR 0 = 1 (0.804772)	 -> 0 (0.489342)
1 XOR 1 = 1 (0.815360)	 -> 0 (0.493099)

0 XOR 0 = 1 (0.747252)	 -> 0 (0.019117)
0 XOR 1 = 1 (0.735567)	 -> 1 (0.980641)
1 XOR 0 = 1 (0.742461)	 -> 1 (0.980535)
1 XOR 1 = 1 (0.733916)	 -> 0 (0.019912)

0 XOR 0 = 0 (0.384284)	 -> 1 (0.545196)
0 XOR 1 = 0 (0.386848)	 -> 1 (0.524427)
1 XOR 0 = 0 (0.354438)	 -> 1 (0.922754)
1 XOR 1 = 0 (0.355706)	 -> 0 (0.021771)

0 XOR 0 = 1 (0.670121)	 -> 1 (0.577314)
0 XOR 1 = 1 (0.667188)	 -> 1 (0.870601)
1 XOR 0 = 1 (0.635874)	 -> 1 (0.510985)
1 XOR 1 = 1 (0.632673)	 -> 0 (0.059036)

0 XOR 0 = 0 (0.465121)	 -> 0 (0.019321)
0 XOR 1 = 1 (0.536488)	 -> 1 (0.981718)
1 XOR 0 = 1 (0.506689)	 -> 0 (0.489771)
1 XOR 1 = 1 (0.564385)	 -> 0 (0.493413)

0 XOR 0 = 0 (0.461397)	 -> 1 (0.575232)
0 XOR 1 = 0 (0.449817)	 -> 1 (0.612399)
1 XOR 0 = 0 (0.465259)	 -> 1 (0.597774)
1 XOR 1 = 0 (0.456207)	 -> 0 (0.234763)

0 XOR 0 = 0 (0.453744)	 -> 0 (0.023842)
0 XOR 1 = 0 (0.474004)	 -> 1 (0.978284)
1 XOR 0 = 0 (0.478326)	 -> 0 (0.492413)
1 XOR 1 = 0 (0.497930)	 -> 0 (0.492878)

0 XOR 0 = 1 (0.635919)	 -> 0 (0.019710)
0 XOR 1 = 1 (0.626395)	 -> 1 (0.980007)
1 XOR 0 = 1 (0.622861)	 -> 1 (0.980013)
1 XOR 1 = 1 (0.615217)	 -> 0 (0.020488)

0 XOR 0 = 0 (0.368990)	 -> 1 (0.540470)
0 XOR 1 = 0 (0.359281)	 -> 1 (0.502355)
1 XOR 0 = 0 (0.373723)	 -> 1 (0.917480)
1 XOR 1 = 0 (0.362805)	 -> 0 (0.053266)

0 XOR 0 = 1 (0.538190)	 -> 1 (0.587219)
0 XOR 1 = 0 (0.485423)	 -> 1 (0.867185)
1 XOR 0 = 1 (0.524154)	 -> 1 (0.538894)
1 XOR 1 = 0 (0.477507)	 -> 0 (0.025398)

0 XOR 0 = 0 (0.317085)	 -> 0 (0.300653)
0 XOR 1 = 0 (0.314979)	 -> 1 (0.896030)
1 XOR 0 = 0 (0.313905)	 -> 0 (0.400344)
1 XOR 1 = 0 (0.311818)	 -> 0 (0.400875)

0 XOR 0 = 1 (0.651883)	 -> 1 (0.569301)
0 XOR 1 = 1 (0.667792)	 -> 1 (0.613104)
1 XOR 0 = 1 (0.665322)	 -> 1 (0.598036)
1 XOR 1 = 1 (0.680291)	 -> 0 (0.238971)
#1
Your objective is to minimize E=delta^2, so you compute the partial derivative of that quantity with respect to each weight. This turns out to be not too hard to compute. You should probably give it a try yourself. The formula for the weights of the output neurons are easy to compute, and the ones of previous layers are not terribly difficult either. Once you have those partial derivatives, you update all the weights using:
weight_i = weight_i - epsilon * partial_derivative_i

epsilon is a small constant that controls how fast your ANN learns.

#2
Treat the bias the same way you treat other weights (see #1).

#3
Remember how much you changed each weight on the last iteration, and use the formula
new_update_i = - epsilon * partial_derivative_i + lambda * old_update_i
weight_i = weight_i + new_update_i

lambda is a value slightly under 1 that controls how much memory your updates have.
Advertisement
Thanks for such a speedy reply!

I however don't fully understand a part of the first answer.

#1
Your objective is to minimize E=delta^2...
OK. But did I get delta OK? Am I calculating it correctly?

...so you compute the partial derivative of that quantity with respect to each weight. This turns out to be not too hard to compute. You should probably give it a try yourself. The formula for the weights of the output neurons are easy to compute, and the ones of previous layers are not terribly difficult either.

I'm sorry but I don't quite get this. Could you maybe explain a bit more? Or (if that's not too much trouble) using some equations/examples. I have no trouble with derivatives as long as I know what to derive and at the moment I don't...

Once you have those partial derivatives, you update all the weights using:
weight_i = weight_i - epsilon * partial_derivative_i

This is clear. Why '-' though? Or maybe it's not important?

#2
Clear, thanks.

#3
Clear, thanks.


UPDATE
I have implemented both #2 and #3. It still doesn't work. But now the resulst are very different from each other. Examples:
0 XOR 0 = 1 (0.752442)   -> 0 (0.000063)0 XOR 1 = 1 (0.752411)   -> 0 (0.000063)1 XOR 0 = 1 (0.750067)   -> 0 (0.000063)1 XOR 1 = 1 (0.750056)   -> 0 (0.000063)0 XOR 0 = 1 (0.671167)   -> 1 (0.999945)0 XOR 1 = 1 (0.677606)   -> 1 (0.999945)1 XOR 0 = 1 (0.669963)   -> 1 (0.999945)1 XOR 1 = 1 (0.675392)   -> 1 (0.999945)0 XOR 0 = 0 (0.455416)   -> 0 (0.000083)0 XOR 1 = 0 (0.451929)   -> 0 (0.000083)1 XOR 0 = 0 (0.457289)   -> 0 (0.000083)1 XOR 1 = 0 (0.454057)   -> 0 (0.000083)0 XOR 0 = 0 (0.404830)   -> 0 (0.000083)0 XOR 1 = 0 (0.396484)   -> 0 (0.000083)1 XOR 0 = 0 (0.407394)   -> 0 (0.000083)1 XOR 1 = 0 (0.399356)   -> 0 (0.000083)0 XOR 0 = 0 (0.249845)   -> 0 (0.000055)0 XOR 1 = 0 (0.247747)   -> 0 (0.000055)1 XOR 0 = 0 (0.250645)   -> 0 (0.000055)1 XOR 1 = 0 (0.248869)   -> 0 (0.000055)



What parameters should work? Currently I use:
3000 iterations
epsilon = 0.1
lambda = 0.9


[Edited by - QmQ on February 26, 2008 11:36:31 AM]
Quote:
Original post by QmQ
Your objective is to minimize E=delta^2...
OK. But did I get delta OK? Am I calculating it correctly?

Yes, delta is the difference between your output and the desired value.

Quote:
...so you compute the partial derivative of that quantity with respect to each weight. This turns out to be not too hard to compute. You should probably give it a try yourself. The formula for the weights of the output neurons are easy to compute, and the ones of previous layers are not terribly difficult either.

I'm sorry but I don't quite get this. Could you maybe explain a bit more? Or (if that's not too much trouble) using some equations/examples. I have no trouble with derivatives as long as I know what to derive and at the moment I don't...

Well, it's what I said before. You need to find the derivative of E with respect to each weight. Give it a little more thought, and I'll tell you the answer if you still can't figure it out. Some of those texts you've been reading probably contain this as well.

Quote:
Once you have those partial derivatives, you update all the weights using:
weight_i = weight_i - epsilon * partial_derivative_i

This is clear. Why '-' though? Or maybe it's not important?

It's `-' because you are trying to find the minimum of E (see Wikipedia on gradient descent).

Quote:
What parameters should work? Currently I use:
3000 iterations
epsilon = 0.1
lambda = 0.9

I would start with lambda=0.0, since the network should work without any momentum. The right value for epsilon depends on your data and the configuration of your ANN, in ways that are too hard to predict. Just try a few values in a large range (say, from 0.0001 to 1.0 in powers of 10). If nothing works, your formulas are probably wrong.


NOTE: I forgot to mention it earlier, but the function that you really want to minimize is the sum of the square of delta over your entire data set. In each step you use the square of delta in the current training case as an approximation to the function you want to minimize. Momentum basically mixes in the gradient from previous cases, which makes the approximation a little bit better. You could also just compute the gradient over the entire sample set and only change the coefficients once you have read them all in, but for some reason this is not what people usually do with multilayer perceptrons.

Wikipedia has an article on this subject, but it's not particularly well written.

This topic is closed to new replies.

Advertisement