the best way to train an back prop ANN?
Whats the best way to train the ANN for a two classes problem if the classes are very different in size: class1 10-20 vectors, class2 1000-2000 vectors.
We need to present to network one vector from each class in order, but then vectors from the 1st one will be presented to it too often and it will became biased to 1st class with bad learning of 2nd class. If we present all the [class1] + [class2] vectors per one epoch then 2nd class will be presented too often and it might not to learn the 1st class at all.
anyone with experience in this question?
Highly unbalanced classes can be handled in a number of ways. Here's two that I've had success with.
Perform your epochs with the observations of the smaller class and an equal number of observations from the larger class selected at random. Each epoch will go throuh the entire smaller class and only a subset of those from the larger class. It may appear that the result will be an overtrained network for the smaller class, but this hasn't been my experience. This is particularly useful if you want to develop an ensemble of networks. Train each member of the ensemble in this way independently and then combine their predictions in a majority vote (or other) manner. This is very effective in the decision tree world and I believe the NN world has some good examples as well.
Create a missclassification penalty function that is weighted such that errors made on the smaller class are more influential than the larger class. This is effectively very similar to the option mentioned above, however, you can give non-integer penalties which may help to fine tune the problem. This imposes another set of parameters to optimize, but you should be able to get at the right values through cross-validation.
I hope this helps...
-Kirk
I haven't read the paper yet, but I just found this article from the new issue of Applied Artificial Intelligence. It might give some clues on how to deal with the problem.
http://hci.ece.upatras.gr/pubs_files/j43_Daskalaki_etal_2006.pdf
EVALUATION OF CLASSIFIERS FOR AN UNEVEN CLASS DISTRIBUTION PROBLEM
[Edited by - kirkd on May 9, 2006 10:22:15 AM]
Perform your epochs with the observations of the smaller class and an equal number of observations from the larger class selected at random. Each epoch will go throuh the entire smaller class and only a subset of those from the larger class. It may appear that the result will be an overtrained network for the smaller class, but this hasn't been my experience. This is particularly useful if you want to develop an ensemble of networks. Train each member of the ensemble in this way independently and then combine their predictions in a majority vote (or other) manner. This is very effective in the decision tree world and I believe the NN world has some good examples as well.
Create a missclassification penalty function that is weighted such that errors made on the smaller class are more influential than the larger class. This is effectively very similar to the option mentioned above, however, you can give non-integer penalties which may help to fine tune the problem. This imposes another set of parameters to optimize, but you should be able to get at the right values through cross-validation.
I hope this helps...
-Kirk
I haven't read the paper yet, but I just found this article from the new issue of Applied Artificial Intelligence. It might give some clues on how to deal with the problem.
http://hci.ece.upatras.gr/pubs_files/j43_Daskalaki_etal_2006.pdf
EVALUATION OF CLASSIFIERS FOR AN UNEVEN CLASS DISTRIBUTION PROBLEM
[Edited by - kirkd on May 9, 2006 10:22:15 AM]
Though of course on the well separated data all this does not matter as training and testing is perfect with this approach.
I actually do training on random vectors with cross-validation.
I handle data with very highly unbalanced classes (compare for example case with face detection, a lot of non-face poissibilities compared to face ones).
But besides of the highly unbalanced classes they are very nonlinear and with much overlap, that is groups of scarce class thruout large one as I determined it from SOM clustering.
I use per epoch random vectors from large class equal to the small class vectors number, then large class vectors are shuffled and so on for every epoch, but this indeed results in the bias to smaller class. And the more classes are merged with each other the more the bias. All the same for validation set where we stop at random point of subset from larger class.
I've trained it with back prop and genetic algs and obtained the same inclination towards the bias. May be it is just a matter of randomizing as several runs produces better results compared to others, so just looping thru runs until good results achieved, is also tedious untill automatic
Creating a subset of networks for every subset of larger class is very tedious deal.
There should be some research in this field, will have a look at the paper
I actually do training on random vectors with cross-validation.
I handle data with very highly unbalanced classes (compare for example case with face detection, a lot of non-face poissibilities compared to face ones).
But besides of the highly unbalanced classes they are very nonlinear and with much overlap, that is groups of scarce class thruout large one as I determined it from SOM clustering.
I use per epoch random vectors from large class equal to the small class vectors number, then large class vectors are shuffled and so on for every epoch, but this indeed results in the bias to smaller class. And the more classes are merged with each other the more the bias. All the same for validation set where we stop at random point of subset from larger class.
I've trained it with back prop and genetic algs and obtained the same inclination towards the bias. May be it is just a matter of randomizing as several runs produces better results compared to others, so just looping thru runs until good results achieved, is also tedious untill automatic
Creating a subset of networks for every subset of larger class is very tedious deal.
There should be some research in this field, will have a look at the paper
The inherent problem here for training any classifier using an unbalanced training set is that members of the smaller class will, in general, appear as outliers from the larger class, even if they are spatially separated from the larger class. Personally, on such problems, I would not be using an ANN as a classifier, but would rather use an information theoretic approach such as MML (Minimum Message Length criteria). Such approaches are better suited to dealing with issues of skewed statistics during classification.
Cheers,
Timkin
Cheers,
Timkin
disagree to Timpkin
I trained face detection problem with about 2000 faces types and 200000 nonfaces backgrounds. Even without validation during training it provided superior results on the test set after learned the problem
Currently I'm using medical data as the problem which is much complicated than face detection, too much of the variability in both classes
http://hci.ece.upatras.gr/pubs_files/j43_Daskalaki_etal_2006.pdf
EVALUATION OF CLASSIFIERS FOR AN UNEVEN CLASS DISTRIBUTION PROBLEM is nice paper but it only addreses issues of validation metrics as sqrt(Se*Pp) and F-measure, and distribution changes.
If there is a similar paper for training strategies?
I trained face detection problem with about 2000 faces types and 200000 nonfaces backgrounds. Even without validation during training it provided superior results on the test set after learned the problem
Currently I'm using medical data as the problem which is much complicated than face detection, too much of the variability in both classes
http://hci.ece.upatras.gr/pubs_files/j43_Daskalaki_etal_2006.pdf
EVALUATION OF CLASSIFIERS FOR AN UNEVEN CLASS DISTRIBUTION PROBLEM is nice paper but it only addreses issues of validation metrics as sqrt(Se*Pp) and F-measure, and distribution changes.
If there is a similar paper for training strategies?
When it comes to training strategies for NN, you're really looking at steepest decent methods (such as QuickProp, I believe), second order methods like conjugate gradient, Newtonian minimization, or BFGS, and non-gradient methods like Nelder-Mead Simplex, or an evolutionary method. Depending on the structure of your network, you could also investigate cascade correlation, and various pruning methods as well.
To be honest, I don't think the problem will be solved by a different training methodology, but rather an improved description of the accuracy (such as Timkin's suggestion) or data distribution changes. Along those lines, I also believe that you'll be better off with an ensemble rather than a single network. Yes, it will require an extra layer of code/programming to enable such a system, but I think it would be worth it. Just my $0.02.
-Kirk
To be honest, I don't think the problem will be solved by a different training methodology, but rather an improved description of the accuracy (such as Timkin's suggestion) or data distribution changes. Along those lines, I also believe that you'll be better off with an ensemble rather than a single network. Yes, it will require an extra layer of code/programming to enable such a system, but I think it would be worth it. Just my $0.02.
-Kirk
Quote: Original post by yyyy
I trained face detection problem with about 2000 faces types and 200000 nonfaces backgrounds. Even without validation during training it provided superior results on the test set after learned the problem
Superior compared to what other techniques? Furthermore, face detection is a very binary problem. An object is usually very clearly a face or not a face. There aren't too many things in the world that are almost faces. I've done a lot of work on classification of medical data, both imagery and of events in time series data such as EEG. It is a significantly different problem than face detection and if I were you I would not be so quick to dismiss other techniques unless you had a prior constraint to solve the task using an ANN. Of course, you're welcome to use whatever technique you like... but I see all too often someone choosing a tool and then looking for a task to throw it at, rather than first considering the features of the problem and choosing the best tool.
Cheers,
Timkin
well the object is not always clear face or not even to human, consider 20x20 rectangle after wavelet downscaled from original picture and only gray scale values, add different illumination (side, top, bottom, combined, room, street, etc.) or complete pitch darkness (hard to see the face for human), bearded faces, spectacled, negro, face morphologies, etc. It is easy to train to detect a couple of faces, not some universal one. It outperforms human as he can not see a negro face in the pitch darkness
I do not dismiss other techniques, the only constraint currently I've got only my own programmed ANN with backprop and GA, and SOMs. I agree the features are the key data mining to get them far apart as possible and then every technique will work for sure.
My problem tends to get low Se and high Sp from both supervised, unsupervised ANN. And if I train it with different metrics for validation I end up with either low Se/high Sp or high Se/low Sp or gold middle, but in few cases it actualy allows to get both high Se/high Sp for a test set.
At first I used cyclic pass thru small calss untill all large class entries passed per epoch and it seems worked better than using per epoch random equal sized set of entries from large class.
Nice to try group of networks for every equal set from large class
I do not dismiss other techniques, the only constraint currently I've got only my own programmed ANN with backprop and GA, and SOMs. I agree the features are the key data mining to get them far apart as possible and then every technique will work for sure.
My problem tends to get low Se and high Sp from both supervised, unsupervised ANN. And if I train it with different metrics for validation I end up with either low Se/high Sp or high Se/low Sp or gold middle, but in few cases it actualy allows to get both high Se/high Sp for a test set.
At first I used cyclic pass thru small calss untill all large class entries passed per epoch and it seems worked better than using per epoch random equal sized set of entries from large class.
Nice to try group of networks for every equal set from large class
Quote:
Nice to try group of networks for every equal set from large class
I know that this would be extra work to put together, but I would bet your performance goes up nicely with an ensemble. I've been working in a similar situation (predicting biologically active chemical compounds) where there are very few positives and a huge number of negatives. The ensemble invariably boost performance for both classes.
Keep us informed of you progress and whether you decide to implement a different type of classifier, as Timkin suggests.
-Kirk
I actually work also with chemists but on medical data.
I've tried already another attempt. I used static test set suggested from medical site for testing on the training set. But I guessed that if that test data is different from the training set they provided. I combined training and testing sets and used 50% from it 25% for validation and 25% for testing (which is equal to the size of the test set they provide). I trained it with ANN using GA algs with sqrt(Se*Sp) metric for validation set and fitness evaluation and got 75% for Se,Sp,Ac. the same results for overall data set.
This geometric metric from that paper turned out to be very valuable for uneven distributions.
This modular approach needs to get extra thinking of the program, as I keep ANN in separate file so the more the larger class, the more ANN files requeired.
But if the ensemble trained on separate data from large class and having 100 networks one will be trained on some part of training data and if we test it on some data from that region it will be probably the only one with right out put while 99 will out that it is class 1
1,2,3,4,5, .... 100 networks
0,0,0,1,0, .... 0 outs
whats the way to combine their outputs
I've tried already another attempt. I used static test set suggested from medical site for testing on the training set. But I guessed that if that test data is different from the training set they provided. I combined training and testing sets and used 50% from it 25% for validation and 25% for testing (which is equal to the size of the test set they provide). I trained it with ANN using GA algs with sqrt(Se*Sp) metric for validation set and fitness evaluation and got 75% for Se,Sp,Ac. the same results for overall data set.
This geometric metric from that paper turned out to be very valuable for uneven distributions.
This modular approach needs to get extra thinking of the program, as I keep ANN in separate file so the more the larger class, the more ANN files requeired.
But if the ensemble trained on separate data from large class and having 100 networks one will be trained on some part of training data and if we test it on some data from that region it will be probably the only one with right out put while 99 will out that it is class 1
1,2,3,4,5, .... 100 networks
0,0,0,1,0, .... 0 outs
whats the way to combine their outputs
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement