Using the PCA data made the naive bayes classification more accurate while it made the KNN slightly worse. KNN seems to be more accurate according to thefor the given data while Naive Bayes is better for generalizing patterns, since it uses a general idea of a distribution while KNN may overfit while calculating the best class each iteration. It is also a lot more expensive because of its looping of every point and every class mean. KNN is more accurate at the expense of speed, while NB is better for making estimates that hold for different size datasets.
For task 4, we had to plot the clusters a classifier received as the colors of a plot of the first three PCA headers
PCA plot picture (using the output of doing KNN on the activity set with PCA data):
Task 5 was to complete analyses for a dataset using both analyses again, on the normal and the pca versions.
In this project, I learned about how the Naive Bayes and KNN algorithms work. There were a lot of algorithms to fill in although they were outlined, and it took a lot of testing and debugging to get them to work. As usual, dealing with errors from indexing into matrices was the biggest challenge. I also took the chance to fix my get_numeric_headers method in this project, because the test files assume that the headers are in the order they were in the file. Up until now, I had used a dictionary to retrieve them, but I realized I could loop through the raw headers list and return only those with numeric data. This also helped to fix the problem I was having with doing kmeans on PCA data.
Special thanks: Melody, Stephanie