Task 5 was to complete analyses for a dataset using both analyses again, on the normal and the pca versions.
I used my usual dataset of wage data.
This is the result for using naive bayes on the set:
This is the result for using KNN on it:
This is the result for using Naive Bayes on the first three PCA columns of the PCA version of the set made from all of the headers:
This is the result for using KNN on the first three PCA columns of the PCA version of the set made from all of the headers:
To make my training set, I selected some of the original ones and assigned them classes based on what decade they were in (1940s = 0, 1950s = 1, etc) then did the same for the test set's classes, which are used in the confusion matrix to show effectiveness. I did about the same thing for the PCA training and test data but I made them through the UI's kmeans file-writing method, then adjusted them to not have the kmeans information. I've included the four files with what are hopefully helpful names.
In this project, I learned about how the Naive Bayes and KNN algorithms work. There were a lot of algorithms to fill in although they were outlined, and it took a lot of testing and debugging to get them to work. As usual, dealing with errors from indexing into matrices was the biggest challenge. I also took the chance to fix my get_numeric_headers method in this project, because the test files assume that the headers are in the order they were in the file. Up until now, I had used a dictionary to retrieve them, but I realized I could loop through the raw headers list and return only those with numeric data. This also helped to fix the problem I was having with doing kmeans on PCA data.