The main data structure for this week's project was a priority queue, which is essentially a max-heap. Max-heap can be understood as a binary tree, which always maintains the highest number (according to a comparator) at its root. Therefore, max-heap was appropriate to use for this project because I attempt to find N number of most common words from a word count file. If a max-heap's root always holds the highest number, I simply need to remove and return the root to obtain the most common words. With the priority queue, or max-heap, created from the lab, I wrote FindCommonWords class to find the N most common words. It uses WordCounter class's methods to analyze the Reddit comment text files, builds a map, such as BSTMap or Hashmap, to store the analysis, and then stores the data of map in a max-heap. FindTrends class, which does not need a priority queue, uses WordCounter class to analyze the Reddit comment files (data structure for WordCounter class is BSTMap). Then, it calculates the frequency of specific words and prints them out. I will report the results of using FindCommonWords and FindTrends class below,
FindCommonWords class first uses WordCounter class's analyze() method to build a map, which is Hashmap in this class. Based on the data of the map, it builds a max-heap. Thanks to the nature of the heap, I then simply remove the root of the heap to obtain the most common words. In order to obtain N number of the most common words, I used for loop.
This method is the core of this class. It takes in two parameters: 1) name of the text file and 2) number of the most common words to find. I first use WordCounter's analyze() method to build a data structure of (key, value) pairs of the text files. Then I retrieve the data structure using WordCounter's entrySet() method. I added this new method to WordCounter class. From the data structure, I obtain the (key, value) pairs by using its entrySet() method. Finally, I store the pairs in the heap and then remove the root of the heap by N numer of times.
I enabled user to provide three arguments to the command line: 1) Number of the most common words to find, 2) Number of the text files to analyze, and 3) Which method to use: HEAP or List. I will explain the third argument in the extensions section. If one writes 10, 1, HEAP on the command line, FIndCommonWords() class will find 10 most common words of 2008 Reddit comment file using a heap. Below is an output of the example. Note that the decimals represent the frequency of each word.
After analyzing all of the Reddit comment files, I found that there are only slight differences between each file's most frequent words. In general, the most frequent words were: the, to, a, I, of, is, you, it, and, that. Intuitively, this is a resonable result because these words are essential to the writings of English. Below is the table to individual output of each year's comment file.
As mentioned in the introduction, FindTrends class does not need a heap. It simply uses WordCounter class to build a data structure, which is BSTMap in this class, based on its analyze() method. It needs the data structure, or map, in order to find a frequency of specific words. The list of words is provided through the command line argument. I used Java's builtin arrayList to store the list of words and then used for loop to find a frequency of each word. Thus, the user must provide a list of words on the command line in order to run this class. Assuming that the user would want to analyze the trend across all text files, I did not enable the user to control the number of text files to analyze.
In order to test whether my FindTrends class works, I ran it with the following argument: snapchat, uber, tesla, microsoft, apple, yahoo and obtained the below graph. Because this graph is identical with the one on this week's project webpage, I concluded that my FindTrends class works appropriately.
The output from the terminal may look like below. The decimals represent the frequency of each word.
For Task 3, I used the following argument: clinton, sanders, rubio, trump, obama, cruz, palin. Below is the graph of the results.
It should not be surprising that Obama was mentioned the most in comments in 2008 since he was elected that year. Clinton was mentioned the second most because she was the second runner to the democrat's primary. However, other politicians' trends are not clear because the two politicians dominate the overall trend. Therefore, I observed another graph without Obama and Clinton.
Romney was mentioned the most in 2012 because he ran against Obama in the re-election of 2012. Berney Sanders's frequency starts to hike up in 2014. Perhaps the reason is that he started to be mentioned more by people as a potential candidate for the Democrat's primary. We can also observe that Sarah Palin's frequency reached a peak in 2010, which is a year later she resigned as the governor of Alaska. Regarding Trump, he was mentioned the most in 2011 since he considered running against Obama at the election of 2012.
- Use more than one list of interesting words and report the trends, including an analysis of the trends and what might explain them.
Given the popularity of sports in the U.S, I wanted to analyze the frequency of major sports, such as soccer, football, baseball, basketball, Olympics, and world cup. I included the last two so that I could observe Americans' interest in international sports events.
Not surprisingly, football is mentioned the most throughout the 8 years that I analyze. Baseball is the second most mentioned sports, which is also not too surprising. What surprised me was that soccer and basketball were essentially mentioned by similar times. The frequency of the Olympics clearly follows the year it is held, which was 2008, 2010, 2012, and 2014 (both summer and winter Olympics). Wordcup, which is barely mentioned, shows a slight increase in 2010 and 2014 when it was held.
Interested in Economics, I conducted another analysis with the following words: recession, bankruptcy, fed, unemployment, and deflation. Below is the graph.
We can observe that all of the words were mentioned quite frequently in 2008 because of the Recession that occurred. The frequency of most words tends to decline as the economy starts to recover from the recession of 2008. However, the frequency of bankruptcy and unemployment reach a peak at the year of 2009 and 2011, respectively, when the risk of the recession still existed. Particularly, the eurozone risk might have made people use the word 'unemployment' more often in 2011.
- An alternative method of identifying the top N words is to read them into an ArrayList and sort them. Do a time comparison of this method with using the heap.
For this extension, I first created another comparator KVListComparator, which uses Integer's compareTo() method. I then used the Collection class's sort() method along with the new comparator as a parameter. The list in the snippet is the arrayList of (key, value) pairs, generated by WordCounter class's analyze() method.
When analyzing all of the 8 comment files, it took 377 seconds using the heap. It took 427 seconds using the ArrayList and sort() method. Heap method is faster because it is O(log(N)) while ArrayList method takes O(N*log(N)).
- Additional Analysis
In order to exploit the priority queue, or max-heap, I created FindLongWords class, which finds the longest word of each file, by using FindCommonWords class as the base. I wrote KVStringComparator in order to sort the heap by the length of each word, rather than the frequency of each word. At first, I expected to observe long and complicated words. However, the result turned out to be vastly different from my expectation. The longest words were not complicated like 'pneumonoultramicroscopicsilicovolcanoconiosis'. They were rather repetition of a simple word. Below is the graph which shows the length of the longest word of each comment file. Please type in a number of text files to read on the command line argument.
The longest length was 10000 character-long. I did not include the words because they were too long for the graph. From this analysis, I learned that there are people who decide to type 'no' 3000 times, 'haha' 1000 times, or 'lol' nearly 3000 times. Some outputs were merely a combination of random numbers that were 5000 character-long. Online comments can be indeed interesting.
- User Control
For FindCommonWords class, I enabled a user to decide the number of the word, the number of text files, and which method to use. The first two arguments are easy to implement as they can be parameters of the calculate() method I created. For the third argument, I created a public enumeration Method in the class.
Within the calculate() method, I added two conditional statements so that the method would use two different methods depending on the user's input. Note that user must provide methods in capitalization: HEAP or LIST.
In order to run FindCommonWords class, user must provide the following arguments: 1) number of words, 2) number of text files, and 3) which method. The text file will be analyzed, starting from 2008 by default.
I learned how to implement the priority queue, max-heap, and how it can be useful when analyzing the trends of data. Moreover, I observed that heap is faster than the arrayList to sort.
For this project, I received help from Professor Maxwell.