Representing decision trees as 'genes'
I would have a good look at first order logic. You could also build decision trees from an example set. Both methods and others on algorithmically constructing good decision trees can be found in "Artificial Intelligence a modern approach" by Russell and Norvig.
Quote: Original post by visage
The other possibility was this: for each tree created, parse all sub trees into parts and store them into the 'parts' drawer. For each part in a tree, that parts' weight value would be equal to the aggregate fitness values for all trees it is in.
The concept of a gene within the dna strand could be modelled by using the boolean theorems, and may produce better 'parts' than a purely random approach.
For instance, a section of one tree might read:
"...x OR NOT x AND y..." which could be swapped with:
"...x OR y..." because they are functionally identical.
sunny OR NOT sunny AND hot = sunny OR hot.
Then the concept of parts makes some sense, so swapping can be done by gene-phrases. the "x OR y" gene could be inherited in place, or it could be exchanged for different gene, like maybe "x AND NOT y".
[Edited by - AngleWyrm on July 9, 2007 6:15:36 PM]
--"I'm not at home right now, but" = lights on, but no ones home
I just wanted to pitch in my $0.02 on standard decision tree induction vs evolutionary decision tree induction.
Standard induction is very simple and involves looking for the best possible split at each node. Starting at the root, we ask the question, which split of the data gives me the best purity in the child nodes? Purity here is related to accuracy - how can I best split the data such that the resulting partition gives me the best possible grouping of classes in the child nodes? This is repeated at each node until you reach a termination criteria - not enough observations to split up, tree is maximum depth, the node is pure (all one class), etc.
The problem I have with this is that it is strictly a greedy induction. The question I asked was, what if I choose a suboptimal split at some early point in the tree that gives me a better set of options later in the tree? You could do this by look-ahead techniques, but this becomes very computationally expensive. I chose to use evolutionary methods because I am effectively evaluating the whole tree at once and not just a single split at a time. This allows us to evaluate interactions within the data that may not be obvious from the greedy approach.
-Kirk
Standard induction is very simple and involves looking for the best possible split at each node. Starting at the root, we ask the question, which split of the data gives me the best purity in the child nodes? Purity here is related to accuracy - how can I best split the data such that the resulting partition gives me the best possible grouping of classes in the child nodes? This is repeated at each node until you reach a termination criteria - not enough observations to split up, tree is maximum depth, the node is pure (all one class), etc.
The problem I have with this is that it is strictly a greedy induction. The question I asked was, what if I choose a suboptimal split at some early point in the tree that gives me a better set of options later in the tree? You could do this by look-ahead techniques, but this becomes very computationally expensive. I chose to use evolutionary methods because I am effectively evaluating the whole tree at once and not just a single split at a time. This allows us to evaluate interactions within the data that may not be obvious from the greedy approach.
-Kirk
Thanks for all the info guys! I am currently re-debating my methods, but the decision tree is definitely a real possibility.
Thanks again!
Thanks again!
Quote: Original post by kirkd
I just wanted to pitch in my $0.02 on standard decision tree induction vs evolutionary decision tree induction.
Standard induction is very simple and involves looking for the best possible split at each node. Starting at the root, we ask the question, which split of the data gives me the best purity in the child nodes? -Kirk
That was a long time ago, but I remember studying building decision tree based on information theory (ID3). If I remember correctly, by choosing the split
that maximize the information gain, you tend to build an optimally balanced tree. Optimal in the Occam's razor sense. While it does not always yield the most optimal tree, its a pretty damn good heuristic. Is the information gain (different of entropy) what you defined as "purity", or are you talking about Gini impurity? Im guessing the later.
Unfortunatly, access to your article is protected. Exactly what decision tree classifier did you compare it against? The abstract only mentions "recursive partitioning" which can mean a bunch of algos. Im guessing CART, but did you compare it to ID3, C4.5 or C5? (God, I hate when reviewers ask me those questions).
Also, the abstract only mentions classification accuracy. What about the depth of the produced trees? The whole purpose of using decision trees is to reduce the number of features you have to consider to classify an example...
Forgive my curiousness, Id very much like to read the article.
Steadtler,
Thanks for the interest and the questions. I'm happy to provide answers as best as I can.
I left my original reference to "purity" fairly vague due to the enormous number of possibilities here. Unfortunately, smart folks like you pick up on those things and start asking questions. 8^) Seriously though, the most common purity measures I've seen are accuracy, Gini Impurity, and Twoing Rule. For those not familiar with these, accuracy simply tries to get as many observations of the same class into each single child node. Very straight forward and easy to understand, but in my experience trees trained with accuracy as the splitting rule tend to do poorly on external data. Gini Impurity, if I remember correctly, has the effect of trying to seperate the largest class from all others at each split. Twoing Rule tries to divide the dataset into equal halves with as little class mixing as possible. You also mentioned information gain, which I belive is the rule used in C4.5 and C5 from Quinlan. I've seen various papers that compared splitting rules and it seems the no free lunch theorem rules supreme in that no one splitting rule always performs better than the others across datasets. Gini, Twoing, and Information Gain seem to have emerged as the pick of the litter and I almost never see other options.
For my evolutionary programming method, each tree is given a Minimum Description Length score. This is equivalent to what Quinlan uses in C4.5 for pruning and, without going into too much detail, is a measure of accuracy on the training set and size of the tree. More accurate, smaller trees give lower scores, so I minimize the MDL score.
I had forgotten that you need a subscription to get to my article. Send me a PM with your e-mail address and I'll send you a PDF reprint.
The comparison I used was CART as implemented in S-PLUS v7. I did not do a direct comparison against the others. I made the assumption that the various methods would likely perform similarly. Maybe not a valid assumption, but... Generally speaking, the evolutionary programming method seems to outperform most other methods I've compared it to across all datasets I've worked with, including SVM, NN, k-NN, etc.
Very good point. The way I did my comparisons was to impose different stopping criteria - 5 observations per node, 5% of the training set size, and 10% of the training set size. The CART trees do well when you grow them down to 5 observations per node and then prune back. They do less well when I force the leaves to have no less than 5% of the training set size, and they are almost useless with 10% of the training set size per leaf. Evolutionary programming almost always gets the same accuracy for 5% and 10% stopping criteria, and this is usually better than for any of the CART methods. When I use 10% training set size as the minimum leaf size, the trees tend to have a depth of around 3-5 with about 6-8 splits total.
I should also mention that the accuracy I'm refering to is on an external test set. I did all the original studies using 5-fold cross validation in which I held out a random 20% of the data, induced the trees on the remaining 80%, and then evaluated the accuracy on the 20% held out data. The results are then averaged over the 5-folds. I typically also repeat the 5-fold cross validation 5 or 10 times just to make sure the averages aren't affected by a really bad split of the data. When I do this 5-fold cross-validation repeated 10 times, the average performance is almost identical to the median, and it always beats or ties the other methods on the external prediction sets. At least in my hands, that's how it works. (How's that for an academic reply??) 8^)
This brings up another very interesting point. I looked at feature usage on a large set of trees, and saw some very interesting results. The procedure was to grow a large number of trees by evolutionary programming, say 100. Then, I computed the number of times each feature occurs across the whole set. The idea was that the stochastic nature of evolutionary programming leads to different trees from run to run. But, if some features are truly more important than others, they should be selected more often. From the usage frequencies and the total number of features used, I compute the probability that a feature would occur that many times at random. If I have n features, the expected value is 1/n if features were just being chosen at random. What I end up with is a small set of features that occur so often that the probability of random occurance is effectively 0. Not too surprising, but what I didn't anticipate was that a very large set of features occur zero times. These features were effectively selected against as they did not contribute to small, accurate trees. If I then go back and reduce the dataset to only those that occured frequently, I get another boost of about 2-3%. One dataset I remember specifically had 197 features and only 19 of those occured more often than would be expected from random occurence. Not bad for feature reduction, I think.
And, just to make this post even longer (!!!), I have been working on combining trees into an ensemble or forest. From comparisons across 6 datasets (for which I have access to comparisons with other methods like SVM, NN, k-NN, CART, Random Forest, Stochastic Gradient Boosting) my evolutoinary programming method is usually within 1-2% of the best method, Random Forests. Sometimes a little higher, sometimes a little lower than Random Forests, so effectively the same. The nice part is that my forests tend to be around 15-20 trees, while Random Forest usually needs ~100 or more.
If you're still awake, congratulations! I've been working on this stuff for quite a while now, so I tend to get passionate when people ask questions. If you have the strength, feel free to ask questions.
-Kirk
Thanks for the interest and the questions. I'm happy to provide answers as best as I can.
Quote:
That was a long time ago, but I remember studying building decision tree based on information theory (ID3). If I remember correctly, by choosing the split
that maximize the information gain, you tend to build an optimally balanced tree. Optimal in the Occam's razor sense. While it does not always yield the most optimal tree, its a pretty damn good heuristic. Is the information gain (different of entropy) what you defined as "purity", or are you talking about Gini impurity? Im guessing the later.
I left my original reference to "purity" fairly vague due to the enormous number of possibilities here. Unfortunately, smart folks like you pick up on those things and start asking questions. 8^) Seriously though, the most common purity measures I've seen are accuracy, Gini Impurity, and Twoing Rule. For those not familiar with these, accuracy simply tries to get as many observations of the same class into each single child node. Very straight forward and easy to understand, but in my experience trees trained with accuracy as the splitting rule tend to do poorly on external data. Gini Impurity, if I remember correctly, has the effect of trying to seperate the largest class from all others at each split. Twoing Rule tries to divide the dataset into equal halves with as little class mixing as possible. You also mentioned information gain, which I belive is the rule used in C4.5 and C5 from Quinlan. I've seen various papers that compared splitting rules and it seems the no free lunch theorem rules supreme in that no one splitting rule always performs better than the others across datasets. Gini, Twoing, and Information Gain seem to have emerged as the pick of the litter and I almost never see other options.
For my evolutionary programming method, each tree is given a Minimum Description Length score. This is equivalent to what Quinlan uses in C4.5 for pruning and, without going into too much detail, is a measure of accuracy on the training set and size of the tree. More accurate, smaller trees give lower scores, so I minimize the MDL score.
Quote:
Unfortunatly, access to your article is protected. Exactly what decision tree classifier did you compare it against? The abstract only mentions "recursive partitioning" which can mean a bunch of algos. Im guessing CART, but did you compare it to ID3, C4.5 or C5? (God, I hate when reviewers ask me those questions).
I had forgotten that you need a subscription to get to my article. Send me a PM with your e-mail address and I'll send you a PDF reprint.
The comparison I used was CART as implemented in S-PLUS v7. I did not do a direct comparison against the others. I made the assumption that the various methods would likely perform similarly. Maybe not a valid assumption, but... Generally speaking, the evolutionary programming method seems to outperform most other methods I've compared it to across all datasets I've worked with, including SVM, NN, k-NN, etc.
Quote:
Also, the abstract only mentions classification accuracy. What about the depth of the produced trees?
Very good point. The way I did my comparisons was to impose different stopping criteria - 5 observations per node, 5% of the training set size, and 10% of the training set size. The CART trees do well when you grow them down to 5 observations per node and then prune back. They do less well when I force the leaves to have no less than 5% of the training set size, and they are almost useless with 10% of the training set size per leaf. Evolutionary programming almost always gets the same accuracy for 5% and 10% stopping criteria, and this is usually better than for any of the CART methods. When I use 10% training set size as the minimum leaf size, the trees tend to have a depth of around 3-5 with about 6-8 splits total.
I should also mention that the accuracy I'm refering to is on an external test set. I did all the original studies using 5-fold cross validation in which I held out a random 20% of the data, induced the trees on the remaining 80%, and then evaluated the accuracy on the 20% held out data. The results are then averaged over the 5-folds. I typically also repeat the 5-fold cross validation 5 or 10 times just to make sure the averages aren't affected by a really bad split of the data. When I do this 5-fold cross-validation repeated 10 times, the average performance is almost identical to the median, and it always beats or ties the other methods on the external prediction sets. At least in my hands, that's how it works. (How's that for an academic reply??) 8^)
Quote:
The whole purpose of using decision trees is to reduce the number of features you have to consider to classify an example...
This brings up another very interesting point. I looked at feature usage on a large set of trees, and saw some very interesting results. The procedure was to grow a large number of trees by evolutionary programming, say 100. Then, I computed the number of times each feature occurs across the whole set. The idea was that the stochastic nature of evolutionary programming leads to different trees from run to run. But, if some features are truly more important than others, they should be selected more often. From the usage frequencies and the total number of features used, I compute the probability that a feature would occur that many times at random. If I have n features, the expected value is 1/n if features were just being chosen at random. What I end up with is a small set of features that occur so often that the probability of random occurance is effectively 0. Not too surprising, but what I didn't anticipate was that a very large set of features occur zero times. These features were effectively selected against as they did not contribute to small, accurate trees. If I then go back and reduce the dataset to only those that occured frequently, I get another boost of about 2-3%. One dataset I remember specifically had 197 features and only 19 of those occured more often than would be expected from random occurence. Not bad for feature reduction, I think.
And, just to make this post even longer (!!!), I have been working on combining trees into an ensemble or forest. From comparisons across 6 datasets (for which I have access to comparisons with other methods like SVM, NN, k-NN, CART, Random Forest, Stochastic Gradient Boosting) my evolutoinary programming method is usually within 1-2% of the best method, Random Forests. Sometimes a little higher, sometimes a little lower than Random Forests, so effectively the same. The nice part is that my forests tend to be around 15-20 trees, while Random Forest usually needs ~100 or more.
If you're still awake, congratulations! I've been working on this stuff for quite a while now, so I tend to get passionate when people ask questions. If you have the strength, feel free to ask questions.
-Kirk
Quote: Original post by kirkd
If you're still awake, congratulations! I've been working on this stuff for quite a while now, so I tend to get passionate when people ask questions. If you have the strength, feel free to ask questions.
-Kirk
Im still awake, what with coffee and all. Thanks for the answers, you can soak your bloody fingers in rose perfumed water now :) Looks very interesting, tho I am always sceptical of evolutionary methods (which are essentially a gradient search with a lot of random involved, IMO). Learning methods are a b*tch to study, as people tend not to know very well the methods they compare their own research against*, so thanks a lot, you just made me a better researcher ;)
*Ive seen a paper comparing a NN with a SVM using a -linear- kernel... sad.
SVM with linear kernel vs NN? Wow. That is sad.
I agree that evolutionary computation seems to get applied to everything in the world, when it probably should only be applied to a few specific cases. In my case, I think EC takes some of the greedy out of tree induction. Just to test the randomness of it all, I set up a very long run of purely random tree induction with no EC steps. The purpose was to see if I could just generate trees at random and eventually find one that was as good as those that were evolved. Even after letting it run for days (100s of millions of random trees evaluated), I never found a better one.
As a general modeling procedure, this one seems to work very well, and I have yet to find anything that consistently outperforms it. That being said, it is probably very dependent on my domain of research. Computational chemistry tends to require models of very non-linear, multiple mechanism-based, strongly interdependent systems with features that never occur in "nice" distributions. ugh. Why do I do this stuff??
-Kirk
I agree that evolutionary computation seems to get applied to everything in the world, when it probably should only be applied to a few specific cases. In my case, I think EC takes some of the greedy out of tree induction. Just to test the randomness of it all, I set up a very long run of purely random tree induction with no EC steps. The purpose was to see if I could just generate trees at random and eventually find one that was as good as those that were evolved. Even after letting it run for days (100s of millions of random trees evaluated), I never found a better one.
As a general modeling procedure, this one seems to work very well, and I have yet to find anything that consistently outperforms it. That being said, it is probably very dependent on my domain of research. Computational chemistry tends to require models of very non-linear, multiple mechanism-based, strongly interdependent systems with features that never occur in "nice" distributions. ugh. Why do I do this stuff??
-Kirk
Quote: Original post by kirkd
ugh. Why do I do this stuff??
-Kirk
'Cause the chicks dig it.
Quote: Original post by kirkd
For my evolutionary programming method, each tree is given a Minimum Description Length score.
Why, particularly, did you use MDL as opposed to MML?
As for the chicks digging this stuff... who? Where? How do I meet them? My wife just frowns and walks away whenever she stops to peer over my shoulder at what I'm working on. She doesn't even bother to ask 'what does that all mean' any more! 8(
Cheers,
Timkin
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement