Advertisement

Cross validation - what is it good for?

Started by June 17, 2007 11:57 AM
3 comments, last by Rdan 17 years, 4 months ago
Hi! I have a neural network and I need to use cross validation. I understand that I'm supposed to take subsets from the training data and... what? Do I need to train my net with those subsets or do I test it with them? Why would I do that? thanks [smile]
When you searched Google for "cross-validation" before posting here (because I KNOW you aren't the sort of person who would waste people's time asking questions that could be answered with a five second google search) what did you find insufficient about the information there?
Advertisement
Cross validation is used to see how a model perform, ie how it is able to generalize over a sample of testing example.

basically you just select a (random) subset of your training example, create a model from that, and then classify the test unselected example of your training set. You repeat this step many times, and this will give you "the" error of your model.

And don't forget that the error this will gives you is on the training set, this mean that this was "a good" training set for your problem :)
Cross validation is important because after a certain amount of training your model becomes overfitted to the data. If you have a training set and a test set and graph errors for both as training increases, you observe that at the beginning both errors go down, but there is a point at which the training set error goes down and the test set error goes up.

The data you are training on is a sample from the population. If you train until you have no error on the sample, your model is not likely to generalize. It's like the difference between rote learning and learning to recognize patterns. It's possible to have a model that is capable of repeating the answers to anything on the training set, and so has no error on the training set, but very high error on everything else. Stopping training early protects against specificity to training set. Related to this are penalty functions for large weights and heuristics to be conservative in the number of hidden layers and units. Another way to think about it is that there is noise in the data, and training until there is no error on the training set is only possible by encoding information about noise in the data.
when a clustering program is created in a supervised situation, it is necessary to be sure that it can perform in an unsupervised situation. Thus, cross-validation is used.In cross-validation, a portion of the data is set aside as training data leaving the remainder as testing data.
The quality of performace of the program on the testing data reflects how well it would perform in an unsupervised setting.

This topic is closed to new replies.

Advertisement