- Image 1 shows the full picture of my GUI.
There are three new buttons I created in order to analyze my dataset: 'Classify_PCA', 'Create Dow-Jones Labels', and 'Split Dow-Jones'. These buttons can be found in the left control frame under the title 'Final Project'. The first thing user will have to do is to click the button 'Create Dow-Jones Labels' button in order to write csv file with category labels for my dataset. This needs to the first step because my dataset, dow_jones.csv, does not contain category labels. I thus use the dependent variable in the dataset to create labels, which will be used for classification, i.e. supervised learning. When user clicks the button, image 2 will pop up; user then should simply select the file dow_jones.csv . When (this is the cleaned file from dow_jones_index.csv). When labeling is done, user should see a file named dow_jones_5_Y.csv in the current directory, i.e. where the display.py is located at. The number 5 indicates the number of unique labels.
- After creating labels for the dataset, user now needs to click the button 'Spit Dow-Jones', which will involve selecting two files in a row: 1) the data set and (dow_jones.csv) 2) the file with category labels (dow_jones_5_Y.csv). The button will split these files into four new files: dow_jones_train.csv (train set), dow_jones_test.csv (test set), dow_jones_5_train_Y.csv (training category labels), and dow_jones_5_test_Y.csv.(test category labels).
- With train/test sets and category labels, user is now able to conduct classification analysis. To do so, user needs to click the button 'Perform Classify' and dialog window (image 3) will pop up for user to select files for train/test sets, category labels, classification method, and necessary parameters for KNN if selected. Make sure that one selects correct file for each listbox in order to successfully perform the analysis. Once user makes selections, another dialog window (image 4) will pop up so that user can choose which headers to use for classification analysis.If user does not select any header, every header will be selected (if user also provided separate files for category labels; if not, ever header but the last one). This is where user needs to be very careful; for dow_jones dataset, user must select every header but the last one. The reason is that last column is the dependent variable I try to predict. If this variable is included in classification analysis, the result would be biased and inaccurate.
(Image 3) (Image 4)
- When all selections are made correctly and analysis executes successfully, then user should see a new entry in the second listbox of the left control frame as in image 5. Once analysis is loaded, user is then able to select the entry and click the button 'Classify Results' to see the confusion matrices and accuracy values (image 6) or click the button 'Project Classify' to project the newly generated analysis file into my GUI. In order to project the data, user will need to select the analysis file first (such as dow_jones_test_Naive Bayes_5_analysis.csv) and then choose headers from a dialog window. Note that the name of analysis file indicates the classification method and the number of unique category labels. After user performs classification analysis, another new file will be generated, such as dow_jones_test_Naive Bayes_5_results.csv (image 7), which contains the confusion matrices and accuracy values for training&test datasets. I generate this results file so that user won't have to run my GUI again just to see the results of analysis. I create separate analysis and results file because the former is used for projection.
(Image 5) (Image 6) (Image 7)
- Although I will discuss the results of my analysis in later sections, it is quite obvious from image 6 that the accuracy from Naive Bayes analysis was very low. In an effort to improve the accuracy, I enabled my program to conduct PCA on training set and project the test set onto the PCA-transformed space. To use this feature, user needs to click the button 'Classify_PCA', and a dialog window (image 8) will pop up for user to select files for training&test datasets and category labels. Note that this button will simply put training and test sets in PCA-transformed space. To conduct classification, user will need to click the button 'Perform Classify' and follow the steps described above (use the same files for category labels).
- When projecting the analysis, please note that the number of plotted data points will be smaller than what it actually is. I modified my GUI so that when there are more than 200 data points, it randomly chooses 200 data points and plots only them.