Advertisement

Two specific questions on ANN

Started by January 16, 2006 04:27 PM
9 comments, last by Timkin 18 years, 10 months ago
Well, the topic says it all, so here they come (And for the nitpickers, this posting is about multilayered feedforward systems only) 1.) Which unit require a that a bias unit is attached to them? All units? Or all units except for the input layer? Or just the output layer? 2.) How should the bias weights be taught? The general learning formula for differentiable activation functions (as described in e.g AIMA) requires computing g'(in), where "in" isn't really defined for bias nodes. Do the bias weights have to be taught at all actually? It would seem to me that teaching only the non-bias weights could compensate for the fact that bias is always -1. Thanks, -- Mikko
Quote: Original post by uutee
1.) Which unit require a that a bias unit is attached to them? All units? Or all units except for the input layer? Or just the output layer?

Typically any node with an activation threshold has a bias node attached to it.

Quote: 2.) How should the bias weights be taught?

For MLPs you'll find that the bias is traditionally set to 1. Yep... a waste of computational effort! ;)
Advertisement
Sorry to intrude Timkin, but I believe the bias WEIGHT is trained. Not the input which is typically 1 like you said. From my experience, I usually use a bias input for only the input layer and that is sufficient. The reason we use a bias input is so that the hyperplane, (input sum) w1*x1+w2*x2+...= -wn*bias, does not have to intersect the origin.

Assuming you are using backpropagation to optimize your neural network, you still have to compute the derivative evaluated at the SUM even if it is associated with the bias weight.

Man, I really need to make an example to show the derivation. It seems the same questions are being asked over and over.
Thank you for your kind answers, gentlemen

A little "battlefield report": I'm using the formula

(1.) delta(j) = g'(in(j)) * sum(i, W(j,i)*delta(i) )

to compute deltas at non-output layers, and

(2.) W(j,i) <- W(j,i) + learning_const * activation(j) * delta(i)

for updating the weights.

Clearly the formula (1.) can't be applied to bias nodes because it includes the term in(j), which isn't defined for bias nodes. But I noticed that for bias nodes the delta does not need to be computed at all, because bias nodes (by definition) have no "parent nodes", which would require using its delta for updating weights using the formula (2.)

Still, it seems, as Georgia put it, that the bias weights have to be trained using the formula (2.) as usual.

>>From my experience, I usually use a bias input for only the input
>>and that is sufficient.

Thanks - I'll try that. At the moment I'm attaching an individual bias input to each hidden/output unit in the system. The theoretical justification I thought of is that only this way can individual units compute free linearly separable functions (e.g AND). But of course this is just intuition and might be wrong.

-- Mikko
I'm making a little tutorial right now. Give me a few minutes.
OK here is a little tutorial I made on how to determine the weight update equations for a particular neural network from scratch. It might help? It may confuse? I don't know. Hopefully, I didn't make any errors and it will give you the idea on how to determine these equations for other neural networks.

Click me for your salvation
Advertisement

Well thank you mister Georgia, that was very helpful

There are certain differences between your approach and the one in AIMA. A minor difference is that AIMA uses -1 bias whereas you use 1 (but the weight will drift to the opposite direction so it doesn't matter).

A little bigger difference is that you use linear activation for the output layer whereas AIMA uses nonlinear activation (for all layers except for the input layer, whose "activation" is just the input value). I guess your approach is better because it allows choosing the target vector more "naturally". It might also have better theoretical convergence properties (intuition suggest the linear case converges better than nonlinear). I'll try it out.

A second "battlefield report": I've been trying to teach my network the XOR function. At the moment it works, but it seems depressingly dependant on something as simple as initial weights: some randomization helps convergence enormously, but I wonder what are good "general" initial weights would be. At the moment I've left the variance user-configurable.

Wikipedia article says this is normal for ANNs, though. After all, backpropagation is "just" gradient descent, and therefore suspectible to all its problems.

Thank you once more again,
-- Mikko
Not a problem. Glad to be of service. :)
If they use some differentiable non-linear output function, you can go through the derivation again I'm sure. I like to use linear output because I don't want to have to scale my outputs to train, and "unscale" to get the output when I'm using it.

Yeah, there is always the problem of falling into a local minimum. This is why people like to use genetic algorithms instead of gradient even though genetic algorithms can't guarantee a global minimum over a finite number of training sessions.

Edit: Another nice thing about GA method is that the differentiable requirement of the activation function disappears.

[Edited by - NickGeorgia on January 17, 2006 4:02:22 AM]
Quote: Original post by NickGeorgia
Sorry to intrude Timkin, but I believe the bias WEIGHT is trained.


Thanks for the correction/clarification Nick. I've been using architectures lately that don't rely on an input weight matrix, so this context of using bias weights eluded me when I wrote my reply. Thanks for picking up on it.

Timkin
Glad to help Timkin. I saw in another post you were working on Autonomous Vehicles. I bet you're swamped with different architectures. :)

This topic is closed to new replies.

Advertisement