Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


Using the Australia Coast data set and computing the PCA on the following columns: premin, premax, salmin, salmax, minairtemp, maxairtemp, minsst, maxsst, minsoilmoist, maxsoilmoist, and runoffnew, I obtained the below plot (Figure 10) after projecting the first three eigenvectors.

Image Modified (Figure 10)


  • My data is homogeneous as it is simply measurement of length and width of Iris at two different parts: sepal and petal, which are both numeric.
  • According to the results of PCA (Figure 11), there are two significant dimensions as the first two eigenvectors represent about 96% of the data variation.

     (Figure 11)
  • petallength and petalwidth are the primary contributors to the first eigenvector as they have the largest coefficients (.6161 and .6467).  They are positively correlated.  This result makes sense because a petal with a large length would probably also have a large width.
  • Figure 12 shows the plot using the first three eigenvectors.

    Image Modified (Figure 12)


  • Enable the user to select up to five columns from the PCA analysis to plot (x, y, z, color, size).

    After realizing that plotting the PCA analysis is basically displaying the data points of projected data, I tried to figure out how to use my buildPoints() method for PCA.  The problem was that the method needs to know when to normalize the original data set or the project data.  My solution was to add two more parameters to the method and call it with different parameters when simply plotting the data or projected the PCA data.  After solving the problem, I then used the modalDialog, which is used for plotting, to enable the user to select up to five columns.  Figure 14 shows the plotting of the Australia Coast data set using the first five eigenvectors.
    Image Modified (Figure 14)


  • Enable the user to select up to five columns, intermixed from the original data and the PCA analysis to plot. For example, try plotting the Australia Coast data using Latitude and Longitude for the x and y spatial axes, then using the projections onto the first two eigenvectors for color and size.

    Implementing this extension involved completing two tasks: 1) create another dialog window, which shows a list of the original headers and PCA features and 2) create another normalize function in analysis class, which takes in two data objects of the original and PCA.  For the first task, I needed to create mixDialog, which is very similar with modalDialog class.  I needed a new normalize function because intermixing the original data and PCA analysis involved normalizing both data.  Figure 15 shows the plotting of the Australia coast using Longitude and Latitude for the x and y axes and then using the first two eigenvectors for color and size.
    Image Modified(Figure 15)


  • Demonstrate your system on more data sets and discuss the results.

    I decided to use my own data set, 'mydata', which contains economic data about the U.S.from 1970 to 2015.  I performed PCA analysis using all the features in the data and obtained the following result (Figure 16).
     (Figure 16)
    Based on the result, I found that three significant dimensions exist, which represent about 91% of the data variation in total.  Looking at the eigenvector, I noticed that the primary contributor is real GDP and that Government Spending and Corporate Investment are correlated and interest rate and inflation are correlated.  These correlations correspond to my analysis from the previous project using linear regression.  Note that the signs of government spending and corporate investment are opposite in the third eigenvector; therefore, they must be weakly correlated.  This weak correlation again corresponds to my previous analysis.  Figure 17 shows the plotting of mydata.csv using the Government Spending and Corporate Investment for the x and y axes and using the first eigenvector for size.

    Image Modified(Figure 17)


  • Major Update on GUI
    In order to organize my GUI and make it look more clean and neat, I made several updates on it.  First, I eliminated the text-variables that I used to report the mean, std, range, and raw value because they seemed to slow down the program a bit.  I also got rid of the text-variables in the right control frame, which I used to report the coefficients for the linear regression.  Then I created another Dialog child class called coefficientsDialog, which displays the variables, slope, intercept, and r-squared value of the linear regression.  The user should only click the button 'View Coefficients' to see the coefficients after running linear regression.  Figure 18 shows the coefficients of running a single linear regression on pcatest.csv.
     (Figure 18)

    To organize the right control frame, I also removed the list box of shapes and created a OptionMenu to enable the user to select the shape of data point.  I then used the grid() method more extensively to set up the buttons more neatly.  In order to make my program lighter, I also did not store the tk buttons in local variables to save memory.  Lastly, I added labels to separate between control panel, linear regression, PCA analysis, and color legend and scroll bars to the two listboxes.  See Figure 19 and 20 to see the difference made in the right control frame.
    Image Modified(Figure 19; before) Image Modified (Figure 20; after)