Decision trees are a method of classifying data. Computers can develop t= hese classifiers by exhaustively searching for the point at which to divide= a certain variable by comparing the resulting information gain. For this p= roject, we implemented a function to create 1-rule decision trees, and also= used WEKA to generate more complicated trees.

One-rule trees consider only one independent variable in order to predic=
t a dependent variable. Stars are classified in stages based on three varia=
bles: LogMDOT, LogMDISK, and LogMASSC. Ratios of these values are used by e=
xperts, but we were interested to see how well the computer could approxima=
te this relation using 1-R stumps. There are three stages of stars represen=
ted in the data, so the algorithm finds the pair of numbers on which to div=
ide the data into three parts. The possible break points are values for whi=
ch the point directly less of a different class from the point directly gre=
ater. To find these points, I created a list of pairs of the independent va=
lue and the class, and sorted it by the independent value. Any values with =
the same class or value as the one directly before it is removed from the =
list of splitting points. If the independent variable can correctly estimat=
e the class, then this process significantly lowers the possible number of =
split points since most instances of a particular class will neighbor each =
other.

Once these points are determined, a double-nested for-loop evaluates the in=
formation gained by splitting on each possible pair. Information gain is de=
fined as the difference between the entropy before the split and the sum of=
the entropies of each of the branches, weighted by the proportion of value=
s that went down the branch.

For one dataset, we found the following information gains and error rates (=
the spread of classes for the entire set is 19, 1284, 1540):

LogMDOT: x<-8.5379 =3D=3D> stage 3

x<-6.2443 =3D=3D> stage 2

x>=3D-6.2443=3D=3D> stage 1

Info gain: -.6411

Error rate: .4295

Confusion Matrix: 1 2 3

[0, .4422, .5578] (0,1221,1540)

[0, 1 , 0] (0,63,0)

[1, 0 , 0] (19,0,0)

LogMDISK: x<-6.4132 =3D=3D> stage 3

x<-6.0130 =3D=3D> stage 2

x>=3D-6.0130=3D=3D> stage 3

Info gain: -.6802

Error rate: .0341

Confusion Matrix: 1 2 3

[ 0 , 0 , 1 ] (0,0,1447)

[ 0 , .4545, .5454] (0,75,90)

[.0154, .9821, .0024] (19,1209,3)

LogMASSC: x<-.6326 =3D=3D> stage 2

x<-.1011 =3D=3D> stage 3

x>=3D-.1011=3D=3D> stage 2

Info gain: -.6965

Error rate: .4077

Confusion Matrix: 1 2 3

[.1097, .8065, .0839] (17,125,13)

[.0006, .3794, .6200] (1,528,496)

[.0010, .5151, .4839] (1,631,1031)

After each stump was built, the program had all of the previous stumps v=
ote for a class for each data point. The voting was simply adding together =
the confusion vectors for the branch of each tree the point took.

After LogMDOT and LogMDISK, there was a classification error rate of .02743=
.

When LogMASSC was added, the error rate rose to .03412.

Looking at the error rates, it is clear that MDISK performs the best of = these three. The combination of MDOT and MDISK, however, is better than eit= her on its own. Adding MASSC does not help, which makes some sense because = the expert classification does not use it for stage 1, so its presence like= ly adversely affects those stars.

We also wanted to classify sources using a more sophisticated decision t= ree. The actual data that is obtained from observations is the radiant flux= at a particular wavelength. By comparing the ratios of fluxes at different= wavelengths, sources can be classified into their different stages of evol= ution: stage I, II and III. When analyzing the data, fluxes were converted = into magnitudes. Magnitudes are based on the log of the flux so in order to= compare ratios of fluxes we actually compared the difference of magnitudes= .

The training set that we used was from a library of SED models. These mo= dels simulate young stars and return a spectral energy distribution (SED; a= plot of intensity vs. wavelength). The SED can be used to determine the ma= gnitudes at specific wavelengths. The input to these models is a list of ph= ysical parameters of the source that it is modelling. Because we already kn= ow what the values are we also know what stage of evolution the source is a= t. So what we did to create the training set find the magnitude of the sour= ce at four wavelengths in which we have observations of real data in. We th= en took all the combinations of differences between these wavelengths. The = size of the training set is around 50,000. The resulting decision tree is v= ery complicated and too lengthy to follow all the way through.

We used two different test sets to evaluate how well the classification = worked. The first test set was just a different subset of the models like t= he ones we used for training. The results from this were quite good. The de= cision tree was able to classify almost 80% correctly. The second test set = was based on real data which had been classified previously by a different = method. The results of this were not as good at around 50% classified corre= ctly. However, this may reflect the quality of the previous classifications= more than the quality of the decision tree. Additionally, there are a few = subtleties involved such as interstellar extinction which will likely chang= e the results slightly. Regardless, a 50% overlap between these classificat= ion techniques is still exciting. This is much better than what was found f= rom previous comparisons.

------=_Part_294_556159606.1529480381574--