This project was all about kmeans clustering! Which means we created a way for the user to analyze the data by grouping them into however many categories they desired. The algorithm keeps recalculating the means of the data points, and regrouping them into the closest mean's cluster. Then this repeats for a provided maximum number of iterations until the recalculated means are a certain distance from all of the points in their clusters. Starting with categories makes the results more accurate because we can initialize the cluster means with predetermined means instead of making the algorithm do it. My program lets the user do all of this by first opening a file, then choosing kmeans from the command menu. The user chooses which columns to include in the algorithm's inputted data and how many clusters to make. Then, this gets turned into a new csv file named originalfilename-#clusters.csv where the number of clusters is the "#". Lastly, the user can open that and plot it normally, except now a checkbox appears that lets the user choose whether to color code clusters smoothly or with distinct colors. I tested my program on my wages data from project 5.
Step one Task 1 was to test our kmeans classifiers on the provided massive dataset about walking, running, jumping, etc.
What gave me some trouble on step 4 was implementing color options. When plotting kmeans data, the assumption is that the user wants to use the default cluster ids to determine color, but if they specify a color column, that will override the default settings. Also, there is a checkbox for whether to have a gradient of yellow to blue or to have distinct colors. This checkbox only appears for kmeans data, which I confirm by checking if there is a column called "cluster". I ended up making a new color method, distinctColorFormula, which uses the colorsys python module to map the number inputted to an hsv tuple then converts it to an rgb tuple. I also multiply that tuple by 255 because it is just on a scale of 0 to 1 otherwise. An important part of creating distinct colors is to take into consideration how many clusters there are.
Task 5 was to plot the Australia data set cluster data (10 clusters). This is my result. The points are fairly well clustered but it seems that the light brown points are scattered around. Some lighter purple points are mixed into New Zealand, indicating that those were not all successfully grouped together, either. The more clusters we make, the less accurate the results seem to be.
For the last task, I clustered the data from project 5 about wages. I made 3 clusters; you can see them in blue, red, and green below. Since it was clustered about year, women's wages, and men's wages, it seems to be divided up into sections where the slope is consistent. All three dimensions were taken into account here. This shows that there is a sharper increase in men's wages compared to women's . As the second picture below shows, the men's wages tend to even out with the women's the higher they get, suggesting that the wages become more balanced the higher they get, but start off with a higher average for men. This is why they have been clustered along with the slight changes in slope.
This project was helpful for learning about how to code clustering algorithms. It was a challenge to incorporate colors, but I learned more about how to manipulate tuples through my new method. I also learned that my method of incorporating PCA is not very efficient, because I couldn't complete the extension doing kmeans on PCA data due to the way I checked for whether I had normal or PCA data so often. I ended up not doing it because I would have had to restructure a lot of things, which I don't have time for. It all leads back to the way I get the headers differently based on what kind of data is being used, it seems.
Thanks to: Stephanie, probably Melody