Decision Trees

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


One-rule trees consider only one independent variable in order to predict a dependent variable. Stars are classified in stages based on three variables: LogMDOT, LogMDISK, and LogMASSC. Ratios of these values are used by experts, but we were interested to see how well the computer could approximate this relation using 1-R stumps. There are three stages of stars represented in the data, so the algorithm finds the pair of numbers on which to divide the data into three parts. The possible break points are values for which the point directly less of a different class from the point directly greater. To find these points, I created a list of pairs of the independent value and the class, and sorted it by the independent value. Any values with the same class or value as the one directly before it is removed from the list of splitting points. If the independent variable can correctly estimate the class, then this process significantly lowers the possible number of split points since most instances of a particular class will neighbor each other.
Once these points are determined, a double-nested for-loop evaluates the information gained by splitting on each possible pair. Information gain is defined as the difference between the entropy before the split and the sum of the entropies of each of the branches, weighted by the proportion of values that went down the branch.
For one dataset, we found the following information gains and error rates (the spread of classes for the entire set is 19, 1284, 1540):
LogMDOT: x<-8.5379 ==> stage 3
x<-6.2443 ==> stage 2
x>=-6.2443==> stage 1
Info gain: -.6411
Error rate: .4295
Confusion Matrix: 1 2 3
[0, .4422, .5578] (0,1221,1540)
[0, 1 , 0] (0,63,0)
[1, 0 , 0] (19,0,0)

LogMDISK: x<-6.4132 ==> stage 3
x<-6.0130 ==> stage 2
x>=-6.0130==> stage 3
Info gain: -.6802
Error rate: .0341
Confusion Matrix: 1 2 3
[ 0 , 0 , 1 ] (0,0,1447)
[ 0 , .4545, .5454] (0,75,90)
[.0154, .9821, .0024] (19,1209,3)

LogMASSC: x<-.6326 ==> stage 2
x<-.1011 ==> stage 3
x>=-.1011==> stage 2
Info gain: -.6965
Error rate: .4077
Confusion Matrix: 1 2 3
[.1097, .8065, .0839] (17,125,13)
[.0006, .3794, .6200] (1,528,496)
[.0010, .5151, .4839] (1,631,1031)

After each stump was built, the program had all of the previous stumps vote for a class for each data point. The voting was simply adding together the confusion vectors for the branch of each tree the point took.
After LogMDOT and LogMDISK, there was a classification error rate of .02743.
When LogMASSC was added, the error rate rose to .03412.