Advertisement

BackPropagation in a two layer feed forward network

Started by March 22, 2010 03:36 PM
7 comments, last by Predictor 14 years, 8 months ago
I've been playing with neural networks these days and found quite an interesting thing(for me) about them. I'm trying to solve the XOR problem using a two layer network. I found that in the initial phase every weight must be initialized with some random value. These random values seem to be very important or so i found. There are combinations of random valuse that will cause the neural network to not converge to a solution, and i wonder why is this so? How can i initialize the random values so that the neural network will always converge? Take as an example the code here in the file bpnet.h replace line 31 from this

srand((unsigned)(time(NULL)));

to this

srand(1269290629);

Now, when running the sample you'll see that the neural network did not successufully converge to an approximation of the XOR function. Why is that?
Let me start by saying that I am not an expert in ANNs. But I do know something about optimization and numerical computation in general.

I don't think you are going to find much in the way of hard guarantees of convergence. The potential function that you are trying to minimize might have local minima where your optimization gets stuck.

I haven't looked at your code, but a common mistake people make is initializing the weights to values that are too large. When you add up a N of random numbers of size S, the result will typically have a size of sqrt(N)*S. This can pretty quickly saturate your sigmoid, and then it might be hard to get out of there. So try initializing the weights to something small (perhaps something of the order of 0.5/sqrt(number_of_inputs_to_this_neuron)). This way your network will initially learn mostly linear effects, and as weights get larger it will go into the non-linear effects (saturation).

Advertisement
Quote: Original post by Deliverance
I've been playing with neural networks these days and found quite an interesting thing(for me) about them. I'm trying to solve the XOR problem using a two layer network. I found that in the initial phase every weight must be initialized with some random value. These random values seem to be very important or so i found. There are combinations of random valuse that will cause the neural network to not converge to a solution, and i wonder why is this so? How can i initialize the random values so that the neural network will always converge?

Take as an example the code here

in the file bpnet.h replace line 31 from this
*** Source Snippet Removed ***
to this
*** Source Snippet Removed ***

Now, when running the sample you'll see that the neural network did not successufully converge to an approximation of the XOR function. Why is that?


I don't know for sure, because it could be any of several different causes, but I will guess that your model is falling into local optima: areas which are better than their immediate surroundings, but not actually the best possible solutions. If this is indeed the problem, there are several possible remedies:
1. re-run multiple times with different random initialization each time
2. initialize more intelligently (so that the initial model breaks the data well)
3. use an optimizer which is less prone to becoming trapped in local optima (hybrid global/local methods may work well for this)

Also, consider that "convergence" of the training process is likely not necessary not even desirable: By the time the training process quits, you have likely overfit the data. Better results can be had through early stopping or constraining the number of hidden nodes in the model.
Quote: Original post by Predictor
[...sensible stuff that I agree with completely...]

Also, consider that "convergence" of the training process is likely not necessary not even desirable: By the time the training process quits, you have likely overfit the data. Better results can be had through early stopping or constraining the number of hidden nodes in the model.


I have never been very convinced by early stopping. I fully appreciate how much of a danger overfitting is, and the natural solution to me seems to be to reduce the number of free parameters that make up the model (i.e., fewer hidden nodes). If the model has more parameters than what the data grants, won't early stopping give us a function that still has too many "wrinkles", except now they are random instead of overfit?

Have you had good experiences using early stopping? Perhaps there is some way of looking at it that I am missing?
I can confirm what everyone else said. It's been shown that XOR has a local minima, and I think something like a quarter of the starting configurations will get stuck depending on your momentum.

Note that if you use 2-3-1 as a network configuration, the local minima disappears if I remember correctly...


That's one reason why people try XOR, it's a trivial but sufficiently interesting problem to solve!

Join us in Vienna for the nucl.ai Conference 2015, on July 20-22... Don't miss it!

Quote: Original post by alexjc
Note that if you use 2-3-1 as a network configuration, the local minima disappears if I remember correctly...

Apparently not.

Advertisement
Thanks for the answers! I better understand now what is happening :D!
Alvaro,

Interesting paper, thanks! I'm glad it takes a whole research paper to show I'm wrong :-)

The statistical approach of checking how often a 2-3-1 network manages to train on XOR is actually pretty informative. I have yet to see it fail, despite what the paper says in theory. That said, in practice I'd use the RPROP as much as possible and avoid these problems!

Join us in Vienna for the nucl.ai Conference 2015, on July 20-22... Don't miss it!

Quote: Original post by alvaro
Quote: Original post by Predictor
[...sensible stuff that I agree with completely...]

Also, consider that "convergence" of the training process is likely not necessary not even desirable: By the time the training process quits, you have likely overfit the data. Better results can be had through early stopping or constraining the number of hidden nodes in the model.


I have never been very convinced by early stopping. I fully appreciate how much of a danger overfitting is, and the natural solution to me seems to be to reduce the number of free parameters that make up the model (i.e., fewer hidden nodes). If the model has more parameters than what the data grants, won't early stopping give us a function that still has too many "wrinkles", except now they are random instead of overfit?

Have you had good experiences using early stopping? Perhaps there is some way of looking at it that I am missing?


I understand intuitively what you're saying, but all I can tell you is that I have had good experiences with early stopping and that I know several authors who suggest it as well. The flip side, of course, is that exploration of the effect of the number of hidden nodes takes more time to compute than having "too many" and stopping early, though I suppose this might not be too bad, especially on today's hardware.

This topic is closed to new replies.

Advertisement