Decision Tree
Hey Everyone, it's been a while since I posted... but I was wondering if anyone can let me know what would be their approach to making a really optimized decision tree. I learned about it in my AI class that I took during college, but I'm still not sure if I could make a robust on in the future. Anyway, any input or links would be great.
#1 Belmont, Richter
There aren't that many ways to induce a decision tree. I'm not sure if I have the answer you're looking for, but here goes. A couple of definitions first - your training set or dataset is composed of a number of individual observations, and for each observation you have a set of features. Every observation has the same set of features. The features are usually numerical but don't have to be.
The standard method of decision tree induction is a top-down approach known as recursive partitioning. The root node of the tree holds all of your observations. Scan through all possible splits of the data and find the split which maximizes your purity criteria. What this means is you look at every possible value for every feature and you break the observations into two sets. If we had numerical data, you would essentially put all the observations with a value lower than the current split value into one set and all greater than or equal to the current split value into a different set. This partitioning of the data is then used to develop two child nodes of the current node. You then repeat the process for each child node using only those observations that partitioned into it. Keep doing this until you hit some stopping criteria, such as a minimum number of observations per node, or maximum depth of the tree, etc.
Purity is a measure of how well the partition groups observations of a similar class together. There are lots of measures of purity - simple accuracy, Gini impurity, Twoing rule, etc. There has actually been research that shows no one purity measure is any better than any other all the time.
That is the basic, standard method, so if you have questions on that, I'm more than happy to answer them or to clarify what I've said.
Once you have a tree, there are lots of additional things you can do. One rule of thumb has been to grow your decision tree to a maximal size and then prune it back using a different set of data. This is usually a cross-validation process in which you start off by taking a subset of your training set, say 20%, and holding it back while the decision tree is being grown. You develop the decision tree on the 80%, and then you use the held back 20% to guide you in pruning the tree. This often leads to a tree the performs better on unseen data.
A method that I developed at my job (and actually got published - woo hoo!) was to use an evolutionary computation method. A start off by generating a population of random trees - the features used to split at each node and the value used are selected totally at random. Each tree is scored based on its accuracy and size - smaller, more accurate trees are better. I then select parents by tournament selection in which I choose 4 trees at random from the population and keep the best one as parent #1. I pick 4 more trees at random and keep the best one as parent #2. For crossover, a random node from within each tree is selected and they are swapped between trees. Mutation could also occur in which a node is selected at random and the value used to split or the feautre used to split is changed randomly. Either crossover or mutation gives me 2 child trees, and I repeat this parent selction, crossover/mutation process until I have the same number of children as parents. I then throw out the parents and the children become the new, potential parent population.
I also use a cross validation process during the evolution. I hold back a subset of the data and monitor how well the trees are performing on that held back dataset. As long as accuracies for both the training set and held back set are improving in accuracy, I continue the evolution. Usually what happens is that the predictions for the training set gets better and better and eventually the accuracy for the held back set starts to decline. This tells me that the decision tree is becoming over trained and is starting to find random relationships in the data. This gives me the signal to stop the evolutionary process.
My reason for developing this method was that the standard method is a very greedy induction method in which the best possible split is chosen at each node. My argument was that maybe if I chose a split that was not as good at one node I might have a better choice later in the tree at a subsequent node. Essentially, the standard method optimizes at each split, while my method optimizes the whole tree. In practice, my method has always been able to produce a better tree than the standard recursive partitioning method.
Whew! Long. Sorry about that, but I've been studying this for years and years now, so I get a little passionate. 8^) There are lots and lots of more details you can consider - development of ensembles of decision trees is my current interest. Please let me know if you have any questions. I'm more than happy to answer what I can.
-Kirk
The standard method of decision tree induction is a top-down approach known as recursive partitioning. The root node of the tree holds all of your observations. Scan through all possible splits of the data and find the split which maximizes your purity criteria. What this means is you look at every possible value for every feature and you break the observations into two sets. If we had numerical data, you would essentially put all the observations with a value lower than the current split value into one set and all greater than or equal to the current split value into a different set. This partitioning of the data is then used to develop two child nodes of the current node. You then repeat the process for each child node using only those observations that partitioned into it. Keep doing this until you hit some stopping criteria, such as a minimum number of observations per node, or maximum depth of the tree, etc.
Purity is a measure of how well the partition groups observations of a similar class together. There are lots of measures of purity - simple accuracy, Gini impurity, Twoing rule, etc. There has actually been research that shows no one purity measure is any better than any other all the time.
That is the basic, standard method, so if you have questions on that, I'm more than happy to answer them or to clarify what I've said.
Once you have a tree, there are lots of additional things you can do. One rule of thumb has been to grow your decision tree to a maximal size and then prune it back using a different set of data. This is usually a cross-validation process in which you start off by taking a subset of your training set, say 20%, and holding it back while the decision tree is being grown. You develop the decision tree on the 80%, and then you use the held back 20% to guide you in pruning the tree. This often leads to a tree the performs better on unseen data.
A method that I developed at my job (and actually got published - woo hoo!) was to use an evolutionary computation method. A start off by generating a population of random trees - the features used to split at each node and the value used are selected totally at random. Each tree is scored based on its accuracy and size - smaller, more accurate trees are better. I then select parents by tournament selection in which I choose 4 trees at random from the population and keep the best one as parent #1. I pick 4 more trees at random and keep the best one as parent #2. For crossover, a random node from within each tree is selected and they are swapped between trees. Mutation could also occur in which a node is selected at random and the value used to split or the feautre used to split is changed randomly. Either crossover or mutation gives me 2 child trees, and I repeat this parent selction, crossover/mutation process until I have the same number of children as parents. I then throw out the parents and the children become the new, potential parent population.
I also use a cross validation process during the evolution. I hold back a subset of the data and monitor how well the trees are performing on that held back dataset. As long as accuracies for both the training set and held back set are improving in accuracy, I continue the evolution. Usually what happens is that the predictions for the training set gets better and better and eventually the accuracy for the held back set starts to decline. This tells me that the decision tree is becoming over trained and is starting to find random relationships in the data. This gives me the signal to stop the evolutionary process.
My reason for developing this method was that the standard method is a very greedy induction method in which the best possible split is chosen at each node. My argument was that maybe if I chose a split that was not as good at one node I might have a better choice later in the tree at a subsequent node. Essentially, the standard method optimizes at each split, while my method optimizes the whole tree. In practice, my method has always been able to produce a better tree than the standard recursive partitioning method.
Whew! Long. Sorry about that, but I've been studying this for years and years now, so I get a little passionate. 8^) There are lots and lots of more details you can consider - development of ensembles of decision trees is my current interest. Please let me know if you have any questions. I'm more than happy to answer what I can.
-Kirk
Don't worry about getting passionate. I love trees, but I've always been curious as to where to go for more information on decision trees... which makes you almost like a guardian force of knowledge since I am without any sort of resource. If you could explain about what crossover and mutation is and those other measures of purity I think everything would be a bit more clear. I'm still a little fuzzy on the whole process of generating one, but I've gotten a better understand of doing it thanks to you. I plan to rip apart my AI engine in this fighter I made a while back with a few people and use a decision tree. However if I can't figure it out my second choice will be fuzzy logic. Take care!
#1 Belmont, Richter
Let's start with purity. Remember that at every node, we're looking for a way to partition the observations. Purity gives us a measure of the goodness of that partitioning. The easiest way to think about it consider a dataset with 2 classes. At any particular node, you try to find a way to partition the data into the two child nodes such that the observations that wind up in each node are mostly one class. This would be a purity measure that is purely accuracy based. Again, we want to find a partition of the data that will lead to the two classes be seperated as best as possible.
The other two that you asked about, crossover and mutation, are specific to evolutionary computation methods and do not apply directly to decision trees.
Since I used evolutionary computation to develop my trees, mutation and crossover became important, but for decision trees in general, you don't have to know about them.
In the world of evolutionary computation, crossover is a way in which you basically swap pieces of two things, kind of like crossover in DNA. For decision trees, I just selected subsections of each parent tree and swap them. Mutation is much easier and is simply making a small, random change. This could be something as simple as altering the value used for the split or the feature used for the split, or as complex and completely regrowing a whole section of the tree. Again, these two methods are specific to evolutionary computation and are not necessarily important for decision trees in general.
If you're looking for more details, here are some book recommendations:
Pattern Classification by Duda, Hart, and Stork - this is a very good general referece for pattern classfication methods. The section on decision trees is short and to the point, but good.
http://www.amazon.com/Pattern-Classification-2nd-Richard-Duda/dp/0471056693/ref=pd_bbs_sr_1/104-9587561-4582350?ie=UTF8&s=books&qid=1182726556&sr=8-1
Classificatin and Regression Trees by Brieman - One of the definitive references written by some of the top researchers in the field. Heavy reading in places, but a complete reference.
C4.5 : Programs for Machine Learning by Quinlan - Another of the primary researchers of decision trees. Also quite good.
You can also search the web and find good references. Look in particular for things by Breiman and Quinlan and you'll be A-OK.
-Kirk
The other two that you asked about, crossover and mutation, are specific to evolutionary computation methods and do not apply directly to decision trees.
Since I used evolutionary computation to develop my trees, mutation and crossover became important, but for decision trees in general, you don't have to know about them.
In the world of evolutionary computation, crossover is a way in which you basically swap pieces of two things, kind of like crossover in DNA. For decision trees, I just selected subsections of each parent tree and swap them. Mutation is much easier and is simply making a small, random change. This could be something as simple as altering the value used for the split or the feature used for the split, or as complex and completely regrowing a whole section of the tree. Again, these two methods are specific to evolutionary computation and are not necessarily important for decision trees in general.
If you're looking for more details, here are some book recommendations:
Pattern Classification by Duda, Hart, and Stork - this is a very good general referece for pattern classfication methods. The section on decision trees is short and to the point, but good.
http://www.amazon.com/Pattern-Classification-2nd-Richard-Duda/dp/0471056693/ref=pd_bbs_sr_1/104-9587561-4582350?ie=UTF8&s=books&qid=1182726556&sr=8-1
Classificatin and Regression Trees by Brieman - One of the definitive references written by some of the top researchers in the field. Heavy reading in places, but a complete reference.
C4.5 : Programs for Machine Learning by Quinlan - Another of the primary researchers of decision trees. Also quite good.
You can also search the web and find good references. Look in particular for things by Breiman and Quinlan and you'll be A-OK.
-Kirk
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement