The main purpose of this project was to incorporate the principal-component analysis (PCA) feature into my GUI. Integrating PCA feature involved three things: 1) enable my GUI to execute PCA on a loaded data set, 2) show the eigenvectors and eigenvalues of a selected analysis for the user, and 3) display the projected data based on the analysis. To do so, I needed to complete the following tasks: a) create a function in analysis class, which conducts PCA, b) create more child classes of Dialog that enable the user to select features for PCA, display the results of the analysis, and that give the user lists of columns to plot when projecting the first three eigenvectors, and finally c) project the PCA data based on the user's selection of columns. The first task was straightforward as it simply required following the instructions of the lab. The second one needed more planning of how to lay out the dialog window and a good amount of coding. The last task turned out to be simple but still tricky; it involved slightly modifying buildPoints() function in display class. After completing the tasks, my GUI now has PCA feature along with linear regression. Please refer to the instructions below to use my updated GUI.
Instructions for PCA:
Follow the below instructions to conduct PCA on the data set, 'AustraliaCost.csv'. I will explain how I implemented my extensions later.
- Before performing PCA on a data set, the user must first open and file and read it by pressing 'Control-o'. The user then should select the loaded data set in the first list box of the right control frame and then click the button 'Perform PCA'. If user do not follow these instructions, below error messages (Figure 1 and Figure 2) may show up.
(Figure 1) (Figure 2)
- If user reads in a data set and performs PCA, a dialog window with a list of original features should show up like in Figure 3. The user should then select features for PCA analysis.
- After selecting the features for PCA and then clicking 'ok' button, the user will now see another dialog window (Figure 4) in which one can name the analysis. This is an extension.
- After naming the analysis, the user now should see the analysis loaded in the second list box. In my case, I named the analysis 'Australia_PCA'. See Figure 5 for illustration.
- Once PCA is executed, the user has a variety of options. Whatever user decides to do, one must select the analysis in the second list box first. The user can see the eigenvectors and eigenvalues of the analysis by clicking the button 'PCA Results'. Figure 6 shows the result of running PCA with the following features: premin, premax, salmin, salmax, minairtemp, maxairtemp, minsst, maxsst, minsoilmoist, maxsoilmoist, and runoffnew.
- The user can also project the first three eigenvectors on GUI by clicking the button 'Project PCA'. When the button is clicked, a dialog window (Figure 7) will show up, in which the user can select which eigenvectors to project on which axes. As an extension, I enabled the user to pick the columns to plot and select up to five columns (x, y, z, color, size).
- After selecting the first three eigenvectors (or more), the user can then project the PCA data. I will show the plot of AustraliaCost data using the first three eigenvectors in the next section.
- For an extension, I also enabled the user to select columns, intermixed from the original data and the PCA. To do so, the user should simply click the button 'Intermix PCA'. Then a dialog window (Figure 8), containing the original and PCA features will show up. After making selections, the user can project the mixed data.
- Lastly, the user can store the PCA analysis as CSV file by clicking the button 'Save PCA' and load it later by clicking the button 'Load PCA'. When saving PCA, the name of the analysis will be used when creating CSV file. Note that loading PCA will not project the data right away. It simply re-loads the stored PCA into the listbox so that the user can see the results or project the data again. When loading PCA (Figure 9), the user must select the stored analysis not the original data set.
Required Images & Analysis:
Using the Australia Coast data set and computing the PCA on the following columns: premin, premax, salmin, salmax, minairtemp, maxairtemp, minsst, maxsst, minsoilmoist, maxsoilmoist, and runoffnew, I obtained the below plot (Figure 10) after projecting the first three eigenvectors.
For my choice of data set, I chose the Iris data set and performed PCA using following features: sepallength, sepalwidth, petallength, and petalwidth.
- My data is homogeneous as it is simply measurement of length and width of Iris at two different parts: sepal and petal, which are both numeric.
- According to the results of PCA (Figure 11), there are two significant dimensions as the first two eigenvectors represent about 96% of the data variation.
- petallength and petalwidth are the primary contributors to the first eigenvector as they have the largest coefficients (.6161 and .6467). They are positively correlated. This result makes sense because a petal with a large length would probably also have a large width.
- Figure 12 shows the plot using the first three eigenvectors.
I decided that the name of my program is 'BAD', which stands for 'Basic Analysis of Data'.
Enable reading and writing the PCA data an analysis as a CSV file. You will need to somehow store the eigenvectors, eigenvalues, and column averages along with the projected data. Note that there should be as many eigenvectors, eigenvalues, and means as there are columns of data.: Implementing this extension was quite similar with storing and re-ploting the linear regression for the last project. It was a bit more challenging as I did not display the results PCA using text-variables. In other words, I needed to figure out another way to obtain information about PCA besides text-variable stored as a field in display class. My solution was to store the PCA data object as a list so that I can store any analysis even if PCA is performed on multiple data sets. After obtaining the necessary information, I then write the filename of original data set, eigenvectors, eigenvalues, original mean, original headers, and project data line by line. Figure 13 shows an example of stored analysis of 'pcatest.csv'.
When loading the stored analysis, what's really necessary is the filename of the original data set, original headers, and the name of the analysis. The reason is that I create the data object and perform PCA again in loadPCA() function of display class.
Add other features, like the ability to name an analysis.: The extension was essentially about creating a new child class of Dialog, in which the user can type the name of analysis. To do so, I created a class nameDialog in Dialog.py. I then modified my performPCA() method in display.py so that a dialog window shows up after the user selects which columns to use for PCA. I also needed to create a new field in display class to store the names of original data sets for storePCA() function because the original names were needed when re-loading the analyses.
Enable the user to select up to five columns from the PCA analysis to plot (x, y, z, color, size).: After realizing that plotting the PCA analysis is basically displaying the data points of projected data, I tried to figure out how to use my buildPoints() method for PCA. The problem was that the method needs to know when to normalize the original data set or the project data. My solution was to add two more parameters to the method and call it with different parameters when simply plotting the data or projected the PCA data. After solving the problem, I then used the modalDialog, which is used for plotting, to enable the user to select up to five columns. Figure 14 shows the plotting of the Australia Coast data set using the first five eigenvectors.(Figure 14)
Enable the user to select up to five columns, intermixed from the original data and the PCA analysis to plot. For example, try plotting the Australia Coast data using Latitude and Longitude for the x and y spatial axes, then using the projections onto the first two eigenvectors for color and size.: Implementing this extension involved completing two tasks: 1) create another dialog window, which shows a list of the original headers and PCA features and 2) create another normalize function in analysis class, which takes in two data objects of the original and PCA. For the first task, I needed to create mixDialog, which is very similar with modalDialog class. I needed a new normalize function because intermixing the original data and PCA analysis involved normalizing both data. Figure 15 shows the plotting of the Australia coast using Longitude and Latitude for the x and y axes and then using the first two eigenvectors for color and size.(Figure 15)
Demonstrate your system on more data sets and discuss the results.: I decided to use my own data set, 'mydata', which contains economic data about the U.S.from 1970 to 2015. I performed PCA analysis using all the features in the data and obtained the following result (Figure 16).(Figure 16)
Based on the result, I found that three significant dimensions exist, which represent about 91% of the data variation in total. Looking at the eigenvector, I noticed that the primary contributor is real GDP and that Government Spending and Corporate Investment are correlated and interest rate and inflation are correlated. These correlations correspond to my analysis from the previous project using linear regression. Note that the signs of government spending and corporate investment are opposite in the third eigenvector; therefore, they must be weakly correlated. This weak correlation again corresponds to my previous analysis. Figure 17 shows the plotting of mydata.csv using the Government Spending and Corporate Investment for the x and y axes and using the first eigenvector for size.
- Major Update on GUI: In order to organize my GUI and make it look more clean and neat, I made several updates on it. First, I eliminated the text-variables that I used to report the mean, std, range, and raw value because they seemed to slow down the program a bit. I also got rid of the text-variables in the right control frame, which I used to report the coefficients for the linear regression. Then I created another Dialog child class called coefficientsDialog, which displays the variables, slope, intercept, and r-squared value of the linear regression. The user should only click the button 'View Coefficients' to see the coefficients after running linear regression. Figure 18 shows the coefficients of running a single linear regression on pcatest.csv.(Figure 18)
To organize the right control frame, I also removed the list box of shapes and created a OptionMenu to enable the user to select the shape of data point. I then used the grid() method more extensively to set up the buttons more neatly. In order to make my program lighter, I also did not store the tk buttons in local variables to save memory. Lastly, I added labels to separate between control panel, linear regression, PCA analysis, and color legend and scroll bars to the two listboxes. See Figure 19 and 20 to see the difference made in the right control frame.
(Figure 19; before) (Figure 20; after)
I received help from Professor Taylor and Maxwell to figure out how to implement intermix extension and interpret its result.