Skip to end of metadata
Go to start of metadata

Summary

The goal of this project was to implement and visualize a clustering algorithm, which finds groups of points in data.  The algorithm that we used is kmeans.  This requires the user to pick how many clusters to look for.  This algorithm can either be executed using the numpy.kmeans function, or by writing a function from scratch that executes kmeans  by starting with a certain number of means, calculates which other points are in that group by finding the minimum distance to a mean, and then calculate the new mean of that category until a certain number of iterations have passed or the change in means is less than some threshold. Once the algorithm had been tested, the categories are plotted in different colors with their means in a distinct color.  Variations of clustering using different distance metrics (different ways to test distance to mean), and other clustering algorithm (online clustering that starts with one mean and adds more if a point is more than some threshold away).  This creates a reasonably robust clustering application.  

Description

To implement the kmeans algorithm, I used a combination of 4 functions. 

kmeans_init takes in data, a number of clusters and optionally a column of categories.  If there are categories, I loop through the list of points and keep track of the sum and number of data points in each category.  Then I create a list of means by dividing the sum divided by number of points for each category.  If no categories are given, I create a copy of the data in the list, loop K times, and select a random data point to use as a mean each time.  In either case, I return a matrix of the means.  

kmeans_classify: takes in a data and means, loops over each point and each mean and determines which mean the point is closest to using the L2 distance metric. It then returns a column matrix of the category for each data point and the error for each data point. 

kmeans_algorithm: this was given.  It takes in data and means and loops for a max number of iterations or until the means don't change very much, finds categories using the means, recalculates the mean, and repeats. then it returns a matrix of means, a matrix of categories, and a matrix of errors. 

kmeans: this takes in a data object, a list of headers, a number of cluster, and optionally whether to whiten the data (if the data has different units).  It selects the columns of the headers, then whitens if necessary. Then it calls kmeans_init, and then kmeans_algorithm, and returns the results the kmeans_algorithm.

1) The first task was to use UCI activity recognition data to test my kmeans algorithm.  There are two data sets - one that provides the actual activity recognition data and the other that provides a column of potential categories for each data point.  Here is an image of my output:

This file executes PCA, then uses the 32 eigenvectors that make up 90% of the variation in the data set.  The confusion matrix takes the points that we know fall into each category and show which cluster they end up in.  If the clustering worked perfectly, each activity would be in a single cluster.  This clustering worked well, but not perfectly.  

To make this work better, I increased the number of eigenvectors used by choosing to keep 99% of the variation.  Here is an image of the result - I started by increasing to 95%, then to 99%, the to 100%:

By the time the percent of variation is as high as 100, almost all of the clusters are perfectly represented. 

2) Once I knew that the kmeans algorithm was working, the next task was to incorporate it into the display by showing the clusters in different colors. To do this, I first made a command, called clustering that will pull up a dialog box to allow the user to select with columns of data to use in clustering:

This dialog box gives the user lots of choice: they first need to choose the headers to include in the analysis, then which type of clustering to use (I will explain online clustering later), then if the user selects kmeans, they must select a distance metric (which I will explain later, the base project uses L2), whether or not to whiten the data before analysis, and how many clusters to use.  

This then calls a handlePCA function that uses the inputed information to call the kmeans function.  It adds the column of categories to the data object, using the analysis name as a header (cluster0, etc).  This allows for the user to pull up the analysis later.  Then it adds the cluster analysis to a list box on the right side of the screen. 

If the user selects this clustering:

Then a dialog box appears to let the user pick which headers to display, and whether to use preselected colors or a smooth range of a light to dark color (allows for more clusters):

If the box isn't checked, here is an example using the Austrailia data with 10 clusters:

And this is another example, this time with preselected colors:

  To create these displays, up to 4 headers are used for the x, y, z, and size axes and are plotted in the same way they are to plot any old data file.  Then to create the colors, if the color is preselected and there are fewer than 11 clusters, it uses the category number from the cluster0 column of the data as an index to select its color.  Otherwise, the cluster0 column is normalized and the normalized values create the color the same way that they do when making color when plotting data.  This time, each category will be given the same color.  

3) I have shown some examples above, but using the Austrailia Coast data set, I ran the kmeans clustering using all the headers, whitening the data, and using 10 clusters.  Here is the result using preselcted colors when I plot the latitude, longitude, and the standard deviation of the elevation depth:

4) The fourth task was to use a data set of our choice and demonstrate a characteristic of the data set.  I chose to use a synthetic data set that has clear clusters to demonstrate how the sum squared error changes as the number of clusters change.  I got the data set from https://cs.joensuu.fi/sipu/datasets/
which says that it has 15 clusters.  Here is an image of the plot without clustering:

I ran the kmeans clustering algorithm for 1-10 clusters and kept track of the representation error using an L2 distance metric.   Here are images of each analysis:

I also kept track of the ssd compared to the number of clusters, made a csv, and plotted it.  Here is the result:

Following the idea that the ideal number of clusters is at the elbow when the curve starts to decrease less slowly, I would judge that the ideal number of clusters to balance the representation error and the model cost is actually somewhere around 9.  It would be better to group some of the clusters together than to spend the memory saving extra means. 

Extensions

1) The first extension that I completed was to add the ability to choose and use different distance metrics for kmeans clustering.  As shown above, the dialog box that allows the user to select which headers to include in the clustering algorithm also allows the user to select the distance metric to use in the kmeans algorithm.  L1 uses the sum of the absolute value of the difference between the features of a data point to determine the distance to the mean.  This is an image of the clusters that occur using this distance metric:

The total error in this case is 2.48. 

The L2 metric is the original data metric used - the square root of the sum of the differences squared.  This is an image of this distance metric:

The total error in this case is 1.08 

The L Infinity metric uses the maximum difference between features in two data points.  This is an image of the L infinity matrix: 

The error is 0.77.

The correlation metric is 1 minus the numpy correlation coefficient between the two data points.  Here is an image:

The representation error is 0.00005. 

I considered also including Hamming distance but since that is usually used for strings, it would not work very well in this case.  

It is interesting to note that the representation error is in case is very different since it is in different units, and therefore cannot be compared.  But, they can be compared by looking at where groupings occurred it seems that the L1, L2, and correlation metrics are all very similar, but the L infinity norm presents a different clustering. 

This shows the importance of knowing which distance metric to pick, and why it is important to let the user pick which distance metric to use: their results will be dependent on this choice. 

2) I completed the second extension by adding to the dialog box that allows the user to pick parameters of the clustering algorithm.  As stated in the previous extension, I know allow them to pick a distance metric, and I also let them choose whether to use kmeans clustering or online clustering.  While there are benefits to both of these algorithms, the addition of online clustering helps the user determine at least a ball park for how many clusters there are that they can either use as their final clustering or then run the kmeans clustering with that number of clusters.  Adding online clustering greatly increases the usefulness of the application because the user doesn't need to know anything about the data set before they try to create clusters. 

3) The third extension that I completed was to allow the user to run online clustering on projected data from a PCA analysis. To do this, I added to the dialog box that allows the user to select the headers to run in a PCA analysis:

If the cluster box is checked, the online algorithm will be run on the projected data.  When the user selects it from the listbox to be plotted, 

It identified 8 clusters when running the same PCA analysis from last week. 

4) The fourth extension that I completed was to implement another clustering method, online clustering, which has been discussed above.  This algorithm selects the desired headers, whitens it if necessary, and then starts with the first point in the data as the first mean.  It also finds the standard deviation of each column to be used later.  The it loops over every point, and then loops over mean and find the Euclidean distance to the mean.  If the distance is the minimum so far, save it.  At the end if the minimum distance is greater than the inputed threshold, that point becomes a new mean.  Otherwise, it joins the group of the closest mean.  It returns a list of means, a matrix of categories, and a matrix of errors. 

To implement this algorithm, I added the option to the dialog box as explained above:

When this is chosen, it saves the analysis to the Clustering listbox.  When this is chosen, it plots:

This indicates that there are 8 clusters, which is very close to the given 10. 

5) The fifth extension that I completed was to add labels for the means: 

The means are light pink with a black outlines so that they stand out from the rest of the data. Note that this only happens if the data has not been whitened.  For the whitened data to work, I would need it to unwhiten it. I think that the standard deviation of the column after it has been whitened should be the same as before, so I would need to multiply the means by the standard deviation before I normalize it to be plotted.  

What I Learned:
I learned a lot this week about how to find clusters in data.  The most interesting part for me was exploring the distance metrics because I learned about them in my math classes, but have never seem them in a practical setting.  It was also interesting to explore how the color scheme changes the ability to view and distinguish between the clusters. 

Thanks to:
Stephanie, Bruce

Labels