The main data structure for this week's project was a priority queue, which is essentially a max-heap. Max-heap can be understood as a binary tree, which always maintains the highest number (according to a comparator) at its root. Therefore, max-heap was appropriate to use for this project because I attempt to find N number of most common words from a word count file. If a max-heap's root always holds the highest number, I simply need to remove and return the root to obtain the most common words. With the priority queue, or max-heap, created from the lab, I wrote FindCommonWords class to find the N most common words. It uses WordCounter class's methods to analyze the Reddit comment text files, builds a map, such as BSTMap or Hashmap, to store the analysis, and then stores the data of map in a max-heap. FindTrends class, which does not need a priority queue, uses WordCounter class to analyze the Reddit comment files (data structure for WordCounter class is BSTMap). Then, it calculates the frequency of specific words and prints them out. I will report the results of using FindCommonWords and FindTrends class below,
FindCommonWords class first uses WordCounter class's analyze() method to build a map, which is Hashmap in this class. Based on the data of the map, it builds a max-heap. Thanks to the nature of the heap, I then simply remove the root of the heap to obtain the most common words. In order to obtain N number of the most common words, I used for loop.
This method is the core of this class. It takes in two parameters: 1) name of the text file and 2) number of the most common words to find. I first use WordCounter's analyze() method to build a data structure of (key, value) pairs of the text files. Then I retrieve the data structure using WordCounter's entrySet() method. I added this new method to WordCounter class. From the data structure, I obtain the (key, value) pairs by using its entrySet() method. Finally, I store the pairs in the heap and then remove the root of the heap by N numer of times.
I enabled user to provide three arguments to the command line: 1) Number of the most common words to find, 2) Number of the text files to analyze, and 3) Which method to use: HEAP or List. I will explain the third argument in the extensions section. If one writes 10, 1, HEAP on the command line, FIndCommonWords() class will find 10 most common words of 2008 Reddit comment file using a heap. Below is an output of the example. Note that the decimals represent the frequency of each word.
As mentioned in the introduction, FindTrends class does not need a heap. It simply uses WordCounter class to build a data structure, which is BSTMap in this class, based on its analyze() method. It needs the data structure, or map, in order to find a frequency of specific words. The list of words is provided through the command line argument. I used Java's builtin arrayList to store the list of words and then used for loop to find a frequency of each word. Thus, the user must provide a list of words on the command line in order to run this class. Assuming that the user would want to analyze the trend across all text files, I did not enable the user to control the number of text files to analyze.
For Task 3, I used the following argument: clinton, sanders, rubio, trump, obama, cruz, palin. Below is the graph of the results.
It should not be surprising that Obama was mentioned the most in comments in 2008 since he was elected that year. Clinton was mentioned the second most because she was the second runner to the democrat's primary. However, other politicians' trends are not clear because the two politicians dominate the overall trend. Therefore, I observed another graph without Obama and Clinton.
Romney was mentioned the most in 2012 because he ran against Obama in the re-election of 2012. Berney Sanders's frequency starts to hike up in 2014. Perhaps the reason is that he started to be mentioned more by people as a potential candidate for the Democrat's primary. We can also observe that Sarah Palin's frequency reached a peak in 2010, which is a year later she resigned as the governor of Alaska. Regarding Trump, he was mentioned the most in 2011 since he considered running against Obama at the election of 2012.
- Use more than one list of interesting words and report the trends, including an analysis of the trends and what might explain them.
In order to run FindCommonWords class, user must provide the following arguments: 1) number of words, 2) number of text files, and 3) which method. The text file will be analyzed, starting from 2008 by default.
I learned how to implement the priority queue, max-heap, and how it can be useful when analyzing the trends of data. Moreover, I observed that heap is faster than the arrayList to sort.