Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.



The main data structure for this week's project was a priority queue, which is essentially a max-heap.  Max-heap can be understood as a binary tree, which always maintains the highest number (according to a comparator) at its root.  Therefore, max-heap was appropriate to use for this project because I attempt to find N number of most common words from a word count file.  If a max-heap's root always holds the highest number, I simply need to remove and return the root to obtain the most common words.  With the priority queue, or max-heap, created from the lab, I wrote FindCommonWords class to find the N most common words.  It uses WordCounter class's methods to analyze the Reddit comment text files, builds a map, such as BSTMap or Hashmap, to store the analysis, and then stores the data of map in a max-heap.  FindTrends class, which does not need a priority queue, uses WordCounter class to analyze the Reddit comment files (data structure for WordCounter class is BSTMap).  Then, it calculates the frequency of specific words and prints them out.  I will report the results of using FindCommonWords and FindTrends class below,

Main Code:


FindCommonWords class first uses WordCounter class's analyze() method to build a map, which is Hashmap in this class.  Based on the data of the map, it builds a max-heap.  Thanks to the nature of the heap, I then simply remove the root of the heap to obtain the most common words.  In order to obtain N number of the most common words, I used for loop.


This method is the core of this class.  It takes in two parameters: 1) name of the text file and 2) number of the most common words to find.  I first use WordCounter's analyze() method to build a data structure of (key, value) pairs of the text files.  Then I retrieve the data structure using WordCounter's entrySet() method.  I added this new method to WordCounter class.  From the data structure, I obtain the (key, value) pairs by using its entrySet() method.  Finally, I store the pairs in the heap and then remove the root of the heap by N numer of times.


Main Method() 

I enabled user to provide three arguments to the command line: 1) Number of the most common words to find, 2) Number of the text files to analyze, and 3) Which method to use: HEAP or List.  I will explain the third argument in the extensions section.  If one writes 10, 1, HEAP on the command line, FIndCommonWords() class will find 10 most common words of 2008 Reddit comment file using a heap.  Below is an output of the example.  Note that the decimals represent the frequency of each word.


 to0.02625999 to0.02563286 to0.0263746 to0.02528071
 a0.0231918 a0.02428888 a0.02393183 a0.0242744
 of0.02008783 i0.02411792 and0.0201435 i0.02354834
 and0.01943367 and0.02046803 i0.01994381 and0.0204501
 i0.0171554 of0.01703906 of0.01968899 of0.01675602
 that0.01602533 you0.01567275 that0.01555128 you0.0150105
 is0.01575751 is0.01362519 is0.01502971 it0.01413919
 in0.01331729 it0.01421955 you0.0143402 is0.01329668
 you0.01308683 that0.01407504 it0.01375306 in0.01233809
 to0.02624088 to0.02516254 to0.02594358 to0.02507664
 a0.02440211 a0.02388506 a0.0245522 a0.0235001
 i0.02237575 i0.02248902 i0.02369495 i0.02156968
 and0.020426 and0.02042753 and0.02045069 and0.02032402
 of0.01827029 of0.01632116 of0.01771945 of0.01599683
 you0.0154782 you0.01490447 you0.01532769 you0.01456904
 that0.0146614 it0.01378817 that0.01440568 it0.01336479
 it0.01413483 is0.01326602 it0.01409198 is0.01304905
 is0.01405073 that0.01312599 in0.01275602 that0.01282318


As mentioned in the introduction, FindTrends class does not need a heap.  It simply uses WordCounter class to build a data structure, which is BSTMap in this class, based on its analyze() method.  It needs the data structure, or map, in order to find a frequency of specific words.  The list of words is provided through the command line argument.  I used Java's builtin arrayList to store the list of words and then used for loop to find a frequency of each word.  Thus, the user must provide a list of words on the command line in order to run this class.  Assuming that the user would want to analyze the trend across all text files, I did not enable the user to control the number of text files to analyze.


For Task 3, I used the following argument: clinton, sanders, rubio, trump, obama, cruz, palin.  Below is the graph of the results.

Image Modified

It should not be surprising that Obama was mentioned the most in comments in 2008 since he was elected that year.  Clinton was mentioned the second most because she was the second runner to the democrat's primary. However, other politicians' trends are not clear because the two politicians dominate the overall trend.  Therefore, I observed another graph without Obama and Clinton.


Romney was mentioned the most in 2012 because he ran against Obama in the re-election of 2012.  Berney Sanders's frequency starts to hike up in 2014.  Perhaps the reason is that he started to be mentioned more by people as a potential candidate for the Democrat's primary.  We can also observe that Sarah Palin's frequency reached a peak in 2010, which is a year later she resigned as the governor of Alaska.  Regarding Trump, he was mentioned the most in 2011 since he considered running against Obama at the election of 2012.


  • Use more than one list of interesting words and report the trends, including an analysis of the trends and what might explain them.


In order to run FindCommonWords class, user must provide the following arguments: 1) number of words, 2) number of text files, and 3) which method.  The text file will be analyzed, starting from 2008 by default.


I learned how to implement the priority queue, max-heap, and how it can be useful when analyzing the trends of data.  Moreover, I observed that heap is faster than the arrayList to sort.