Advertisement

Backpropogation and tanh

Started by October 21, 2004 04:22 PM
3 comments, last by smilydon 20 years, 1 month ago
Hi I have 1 hidden layer nn with 11 inputs and 1 output. All the neurons use the sigmoid function. What I really want to do is use the tanh activation function for the output layer as the outputs, in reality, range +/- 1. With the sigmoid function, and some preprocessing of the data to fit the 0-1 range, my net trains in around 35000 epochs. If I then change the out put layer to use the tanh function (and use the original unscaled data) then net trains for about 5000 epochs and then very suddenly the output saturates which screws the training. All weights are initialised +/-1, though I did change those to smaller values after reading one of the the links off fup's ai-junkie site. Still, the output saturated. Does anyone have any idea what I might be doing wrong. Am I trying something stupid. Any ideas, greatly appreciated. Ta M
Why are you preprocessing the data to fit the 0-1 range specifically? That is not necessary, you only need to scale the data so that it is all in the same range, whether that range be 0-1, -500 - 500 or 100 - 200 (it's usually better centered around zero though)

The weights should not be initialized +/-1. They should be initialized to small random values. A good range is -1 < n < 1.

I have never found it necessary to use the tanh activation function myself but I have read that in many cases, when training using backprop or similar, you'll get faster convergence.

These links may provide some interesting reading:

http://groups.google.com/groups?q=tanh%20activation&hl=en&lr=&sa=N&tab=wg
Advertisement
Thanks for that fup. My weights are initialised in the manner you suggest, my explanation was not clear. Only the output is preprocess to be scaled to 0-1 as that is the output of the sigmoid function. The actual values of the output range -1 < n < +1, hence the thought that the tanh function may be better as it is, as you say, centered around 0 with and SD of 1.

I will have a read of the google groups you suggest.

Ta

M
Hi

I have sorted out the issue. The problem is that the error for both output and hidden neurons is calculated differently depending on the type of activation function used. Basically, the error is based on the derivative of the activation function.

E.g

For sigmoid function = 1.0 / ( 1.0 + Math.exp(-1.0 * neuronIp))
derivative = neuronIp * (1 - neuronIp)

for tanh function = (1 - Math.exp(-1 * neuronIp)) / (1 + Math.exp(-1 * neuronIp))
derivative = 1 - Math.pow(neuronIp, 2)

When calculating errors for output neurons
error = (expected op - actual op) * derivative(actual op)

and for hidden neurons
error *= derivative(neuron.lastActivationValue())

The above code is Java(ish) but should be easily readable by C/C++ people.

My net has 11 inputs, 20 hidden layer nerons and 1 output neuron. If I use the sigmoid function throughout, it trains in around 60000 epochs. If I use the tanh function it only takes around 10000.

Hope this helps anyone interested. For more details I found this page very useful; http://www.philbrierley.com/main.html?code/index.html&code/codeleft.html

Ta

M
Quote: Original post by fup
Why are you preprocessing the data to fit the 0-1 range specifically? That is not necessary, you only need to scale the data so that it is all in the same range, whether that range be 0-1, -500 - 500 or 100 - 200 (it's usually better centered around zero though)

The weights should not be initialized +/-1. They should be initialized to small random values. A good range is -1 < n < 1.

When initializing a network, the only thing that you can do wrong is initializing the weights to high. Backpropagation and higher order methods rely on the gradient of the sigmoid or tanh function. This is almost zero far away from the origin so that weight won't update (or you need a very large learning rate).
A safe way to initialize is to compute the sum of max input values and initialize the weights between -1/thesum and 1/thesum.
In that way the total input of a neuron (sum of weights times inputs) is guaranteed to be less than one at the first learning pass.

Quote:
I have never found it necessary to use the tanh activation function myself but I have read that in many cases, when training using backprop or similar, you'll get faster convergence.

A sigmoid is usually preferred in a classification task, because this makes it possible to interpret results in terms of likelihoods. In this way you would also give the output units a sigmoid activation function.
In function approximation usually a tanh is preferred for the hidden units and a linear activation for the output neurons. When you initialize the weights of the hidden units very small (smaller that i suggest above) you basically approximate a linear function. The nonlinearity emerges through learning. For this reason you only see tanh activation functions in control ("theory") applications and hardly ever a sigmoid.
The faster convergence of tanh is due to their symmetry around the origin, which doubles the set of weight values that give identical results compared to the sigmoid.
(If you multiply all input weights of a neuron with minus one and also the weight connecting this neuron to the next layer that for tanh this is identical but for a sigmoid it is not. So for the tanh there are twice as much "solutions" available for the learning rule to find)

This topic is closed to new replies.

Advertisement