Overview

One goal of machine learning is to cluster data points by identifying similar points. A common method of clustering is k-means, which we have implemented for our data. To get better clusters, we first ran principal components analysis to compress the data to axes that represent almost all of the data.

K-Means Algorithm

What is Input

Instead of having the k-means distance metric scale the distances each time (a feat which requires it to remember the mean and standard deviation of each dimension throughout the method), our function expects pre-processed data in which the distances between features are all the same. This means that we subtract off the mean and divide by the standard deviation for each feature before we send the data to be clustered. To simplify things further, the standardized data is run through principal components analysis, and the resulting data in the eigenspace is what is actually clustered. This makes the means returned by the function not very informative, but the labels still correspond to the correct points, which is all that is needed to identify to which cluster each point belongs.

Visualizations

Clusters are categorical, so it is intuitive to visualize them with color. We made "cluster" an option for the color 'axis,' and when selected, more radio buttons appear that allow the user to select how many clusters they would like to see (2-7).