For this project, we learned how to make two classifiers, which are forms of machine learning. We had to make both a Naive Bayes Classifier and a KNN (K Nearest Neighbors) classifier. We trained them using data with given classes. We created a file to test both types; I allowed the user to specify which kind to do at command line. With my implementation, KNN takes a lot longer than naive bayes, but KNN also is a lot more accurate. I also tested it on testing and training datasets I created.
Task 1 was to make a method that would create a confusion matrix method. For this, I read in both the correct categories and the ones the classifier got, and I have a for loop that loops through the amount of points and compares the classes at the given index to each other and adds 1 to the area in the matrix corresponding to the spaces those classes have in the confusion matrix. For example, if a point is in the first class, and its corresponding classifier point is in the third class, it will go in confusionMatrix[0,2].
Task 2 was to create the file that tests the classifiers, which I called classifyTest.py. I take input for which training and test classes to use, what to call the output file (which is the test file plus the column for classifier result classes), which classifier to use, and an optional pair of separate header files.
Task 3 was to run the code on the activity set. The following pictures are my results:
Using Naive Bayes on the activity set:
Using KNN on the activity dataset:
Using Naive Bayes on the activity set with PCA:
Using KNN on the activity set with PCA:
Using the PCA data made the naive bayes classification more accurate while it made the KNN slightly worse. KNN seems to be more accurate for the given data while Naive Bayes is better for generalizing patterns, since it uses a general idea of a distribution while KNN may overfit while calculating the best class each iteration. It is also a lot more expensive because of its looping of every point and every class mean. KNN is more accurate at the expense of speed, while NB is better for making estimates that hold for different size datasets.
For task 4, we had to plot the clusters a classifier received as the colors of a plot of the first three PCA headers
PCA plot picture (using the output of doing KNN on the activity set with PCA data):
Task 5 was to complete analyses for a dataset using both analyses again, on the normal and the pca versions.
I used my usual dataset of wage data.
This is the result for using naive bayes on the set:
This is the result for using KNN on it:
This is the result for using Naive Bayes on the first three PCA columns of the PCA version of the set made from all of the headers:
This is the result for using KNN on the first three PCA columns of the PCA version of the set made from all of the headers:
To make my training set, I selected some of the original ones and assigned them classes based on what decade they were in (1940s = 0, 1950s = 1, etc) then did the same for the test set's classes, which are used in the confusion matrix to show effectiveness. I did about the same thing for the PCA training and test data but I made them through the UI's kmeans file-writing method, then adjusted them to not have the kmeans information. I've included the four files with what are hopefully helpful names.
In this project, I learned about how the Naive Bayes and KNN algorithms work. There were a lot of algorithms to fill in although they were outlined, and it took a lot of testing and debugging to get them to work. As usual, dealing with errors from indexing into matrices was the biggest challenge. I also took the chance to fix my get_numeric_headers method in this project, because the test files assume that the headers are in the order they were in the file. Up until now, I had used a dictionary to retrieve them, but I realized I could loop through the raw headers list and return only those with numeric data. This also helped to fix the problem I was having with doing kmeans on PCA data.
Special thanks: Melody, Stephanie