Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Task 5 was to complete analyses for a dataset using both analyses again, on the normal and the pca versions.

WIP

I used my usual dataset of wage data.

This is the result for using naive bayes on the set:

Image Added
This is the result for using KNN on it:

Image Added
This is the result for using Naive Bayes on the first three PCA columns of the PCA version of the set made from all of the headers:

Image Added
This is the result for using KNN on the first three PCA columns of the PCA version of the set made from all of the headers:
Image Added
To make my training set, I selected some of the original ones and assigned them classes based on what decade they were in (1940s = 0, 1950s = 1, etc) then did the same for the test set's classes, which are used in the confusion matrix to show effectiveness. I did about the same thing for the PCA training and test data but I made them through the UI's kmeans file-writing method, then adjusted them to not have the kmeans information. I've included the four files with what are hopefully helpful names.

In this project, I learned about how the Naive Bayes and KNN algorithms work. There were a lot of algorithms to fill in although they were outlined, and it took a lot of testing and debugging to get them to work. As usual, dealing with errors from indexing into matrices was the biggest challenge. I also took the chance to fix my get_numeric_headers method in this project, because the test files assume that the headers are in the order they were in the file. Up until now, I had used a dictionary to retrieve them, but I realized I could loop through the raw headers list and return only those with numeric data. This also helped to fix the problem I was having with doing kmeans on PCA data.

...