network error space
hello,
i'm trying to understand the training process for neural networks and i have a question:
in a network where the delta rule is used to update the weights, does each input pattern produce its own error surface?
that is, i understand that an input is presented to the network, this is compared to the desired output giving an error. a graph of the error as a function of the weights can be produced. to train, the weights are changed - working their way down the error slopes to a minimum. but i don't understand if each input pattern produces its own error surface or if it is the same one? i'm probably missing something simple, but i've hit a mental block!
thanks all.
I'm not completely familar with the terminology you are using, but each input output pair affects the network differently. So, I'd say each input produces it's own "error surface". The weights are adjusted a small amount after each comparison and then it goes to the next training vector. Then repeat that step alot. Atleast that's how most of mine work.
I think the wikipedia article about stochastic gradient descent is relevant. It describes how approximating the true gradient by using the gradients of only single samples can be more efficient. It's kind of an optimization, but it does change the conditions under which the network will converge (in some complicated mathematical way that I'm not familiar with). You can also use the true gradient by presenting all the pairs and accumulating the weight changes and then performing the weight changes all at once.
Anyway, this isn't really something that simple. I hadn't known about the term "stochastic gradient descent" until I went looking for an answer to this question, and a lot of people who use this strategy for training neural networks are probably not familiar with it either.
Anyway, this isn't really something that simple. I hadn't known about the term "stochastic gradient descent" until I went looking for an answer to this question, and a lot of people who use this strategy for training neural networks are probably not familiar with it either.
Stochastic gradient search is just a fancy name for what you would know as sequential (or online) learning (as opposed to batch/offline learning). It is most often used when training data has a sequential ordering, sa in the case of time parameterisation of the data set. The premise is that a sequence of training pairs generates a sample path of a stochastic process in the parameter space. Just as the value of a stochastic process at any time is a random variable, so too a given training instance generates a sample error from the distribution of errors that the full training domain would generate at that point in the parameter space.
If we presented just one training instance to the network we could trivially optimise the weights to reduce the associated error to zero. Instead though, we usually have a set of training instances and our aim is to minimise the error along the sample path of the stochastic error process. If one such path conveyed all of the information of that stochastic process, then optimising according to this sample would give us a globally optimal network for all possible sample paths. Unfortunately, this is never the case and as such, we must consider many such sample paths (many sequences of training data). Of course, then we come across the problem of which order to present them to the network (sequence 1 first then sequence 2, or vice versa).
This problem arises most often when trying to learn dynamic systems models using non-recurrent architectures.
It's a fascinating field of research/application, but not one for the mathematically feint-hearted! ;)
Cheers,
Timkin
If we presented just one training instance to the network we could trivially optimise the weights to reduce the associated error to zero. Instead though, we usually have a set of training instances and our aim is to minimise the error along the sample path of the stochastic error process. If one such path conveyed all of the information of that stochastic process, then optimising according to this sample would give us a globally optimal network for all possible sample paths. Unfortunately, this is never the case and as such, we must consider many such sample paths (many sequences of training data). Of course, then we come across the problem of which order to present them to the network (sequence 1 first then sequence 2, or vice versa).
This problem arises most often when trying to learn dynamic systems models using non-recurrent architectures.
It's a fascinating field of research/application, but not one for the mathematically feint-hearted! ;)
Cheers,
Timkin
How does presenting training data in a natural order help a non-recurrent network learn the system? As far as I know, even if such a network doesn't propagate all signals instantly so as to allow previous inputs to affect future outputs, it could be replaced with an equivalent network does propagate all signals instantly (and takes extra sets of inputs to represent previous inputs from the sequence instead). The resulting network isn't adjusted after the training is over, so how can there be an advantage in using the natural order of the data instead of an arbitrary order? I just don't see where the advantage is, unless the network is using special elements whose value depends on all previous inputs, which would seem like sort of dodging the "non-recurrent" thing to me.
Quote: Original post by Vorpy
How does presenting training data in a natural order help a non-recurrent network learn the system?
If there is a natural order to the data this is presumably because the data was generated in that order by they system it was derived from. For example, temperature data at a given point in a room. If you present that data to a network in say, a time-reversed order, what you're asking the network to do is learn a different function.
Quote: As far as I know, even if such a network doesn't propagate all signals instantly so as to allow previous inputs to affect future outputs, it could be replaced with an equivalent network does propagate all signals instantly (and takes extra sets of inputs to represent previous inputs from the sequence instead).
I'm not sure what your getting at with this sentence, but it sounds like you're saying you could construct a recurrent network by ensuring that the output at time t is a function of the outputs (and or inputs) at times t-1, t-2, ..., t-k. This is equivalent to an ARMA (auto-regressive moving average) model. Yes, of course you can do it, be we weren't talking about recurrent models.
Think of the problem this way. I present you with a single trace of data generated by my black box dynamic system and I ask you to write down for me the differential equation(s) that generated it, using only the ordered pairs (ti,xi), i=0,1,2,...,n. If you jumble those datum up, under what conditions do you expect (if ever) to end up with the same differential equation?
Hi,
Thanks for your help. i think that i've got it now: online learning produces an error surface for each training pattern, whilst batch produces an 'error surface' for the entire set.
i have a further question. the delta rule says that the change in a weight is give as n(dE/dw). where there is more than one output, is E defined as the SUM of the squares of the errors for each output (or whatever error/cost measurement is used), or is each output given its own individual error 'to work on'? the latter makes more sense to me...
cheers again.
dlr21
Thanks for your help. i think that i've got it now: online learning produces an error surface for each training pattern, whilst batch produces an 'error surface' for the entire set.
i have a further question. the delta rule says that the change in a weight is give as n(dE/dw). where there is more than one output, is E defined as the SUM of the squares of the errors for each output (or whatever error/cost measurement is used), or is each output given its own individual error 'to work on'? the latter makes more sense to me...
cheers again.
dlr21
Quote: Original post by dlr21
where there is more than one output, is E defined as the SUM of the squares of the errors for each output (or whatever error/cost measurement is used), or is each output given its own individual error 'to work on'? the latter makes more sense to me...
The weights in the output layer can be modified on a per channel basis (i.e., each output is a signal channel that can be optimised independently of the others). There are some obvious exceptions to this, but they are probably not relevant to your application (unless you're dealing with a neurocontrol application).
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement