Skip to end of metadata
Go to start of metadata

Hieu Phan & Dan Nelson

Project summary

In this project, we implemented Principal Component Analysis (PCA) and K-mean clustering. We also added these functions two our program.


To do PCA, we write a function in the DataSet class called pca(). This function will normalize the input data if the user wants, then calculate the covariance matrix and then get the eigen values and eigen vectors from that matrix. The calculation is conveniently done by numpy. Then it will sort the eigen vectors in descending order according to the eigen values.

Then we project the data to some selected eigen vectors in the function buildPCA(). The number of vectors is up to the user.

This is the result for testdata1.csv:

normalized data:
(the first column contains the values of the eigenvalues in descending order. The corresponding rows contain the data of the corresponding eigenvectors.)

unnormalized data:


Clustering for cdata.csv

Clustering for cdata4.csv

Australian Coast projected to 3 dimensional space:

Clustering for Australian Coast:

For extensions, we added a menu in our program called Analysis. Under this menu, user can do PCA an K-mean clustering:

Note: In our program, if you want the program to process the whole data in the opened data file, just click Cancel in the Filter windows and the Plot settings windows. So for example to do PCA on the whole Australian Coast data, open that file, the skip Filter and Settings, then go to Analysis-->PCA.

We are aware of this ambiguity and we will fix it soon.