The goal of this project was to find the most common words in a text document of reddit comments and find trends related to how many times a set of words appeared over 8 years of reddit comments. To find the most common words, counts files that include each word and how often it appeared were found in Project 7 using a binary search tree and were read into a binary search tree again, then put into a priority queue. The priority queue removes items in order of how many times the word appears and prints the list to the terminal. In an extension, it reads the word-value pairs directly into a priority queue. Then to find trends, it reads the word-value pairs into a binary search tree and the finds each word in a list and returns its frequency (how many times it appears divided by the total number of words in the document) and prints a list of how many times it appears to the terminal. Then, a graph is made with the results. In extensions, it prints the results to files rather than printing to the terminal. ***add results
1) The first task was to create a FindCommonWords task that will print the words in a word count file (from project 7) in order from the highest number they occur to the least number of times they occur. It has fields for a word counter object (from project 7) and a priority queue heap that uses a comparator that takes in key value pairs that have a string (the word) and an integer (the count) and returns the difference between the counts to find which word occurs more times.
In findWords, I used the readWordCountFile of WordCounter from Project 7 to put all of the word-count pairs into a binary search tree. Then using the getPairs method, I made an arrayList of the word-count pairs. Then I looped over every pair in the arrayList and add them to the priority queue heap. Then it loops over the number of items in the heap and prints out the word-count pair returned when removed is called. This will print every word in the word count file in order from the word that occurs most to the word that occurs least. Here is an image of the ten most common words in the 2008 reddit comments file:
"the" is by far the most common word, almost twice more common that even the second most common word.
"deleted" is the first non pronoun, article, auxiliary or linking verb, word that appears in the list, the 27th occurring 74,414 times.
2) The second task was to create a findTrends class to "determine the frequency of a specific set of words in a set of words". It has the same fields as the findCommonWords class.
I made a findTrends method that starts by clearing the word counter object in case more than one list of words is analyzed with the same FindTrends object. Then it calls the readWordCountFile method of the WordCounter object. It makes an arrayList of Doubles called frequency and puts 0.0 in each location up to the size of the words arrayList passed in to the method. Then for each word in the arrayList, it calculates the frequency by calling the getFrequency method of the Word Counter object and then sets the index of the word in the frequency array list to the frequency of the corresponding word in the words array list. At the end it returns the frequency array list.
Then I used a secondary helper method called findTrend with an array of strings (for the args array) and an array list of words. It starts by making an arrayList frequency of word by Year with empty strings the length of the number of words in the words arrayList. Then looping over the 1st index of the args array up to and including the 2nd index of the args array, it calls findTrends from above on the current year file (argsi.txt). Then it loops over that arrayList and sets the corresponding index in the frequency of word by year to its current value plus the index of the findTrends arrayList. At the end, I print the ArrayList.
I put in a list of animals (cat, dog, hamster, fish, rabbit, frog, ferret, hedgehog, turtle, gecko, lizard, alpaca). Here is the terminal output with the frequencies from the reddit comments in 2008-2015:
It shows the animals in just 2008, then in 2008-2015.
I copied and pasted this into an excel document, added a header, then made a graph of the results: