The goal of this project was to find the most common words in a text document of reddit comments and find trends related to how many times a set of words appeared over 8 years of reddit comments. To find the most common words, counts files that include each word and how often it appeared were found in Project 7 using a binary search tree and were read into a binary search tree again, then put into a priority queue. The priority queue removes items in order of how many times the word appears and prints the list to the terminal. In an extension, it reads the word-value pairs directly into a priority queue. Then to find trends, it reads the word-value pairs into a binary search tree and the finds each word in a list and returns its frequency (how many times it appears divided by the total number of words in the document) and prints a list of how many times it appears to the terminal. Then, a graph is made with the results. In extensions, it prints the results to files rather than printing to the terminal. The most common words found are pronouns articles and very simple verbs like is and have. Cat and dog are the most common pet words over all 8 years tested.
1) The first task was to create a FindCommonWords task that will print the words in a word count file (from project 7) in order from the highest number they occur to the least number of times they occur. It has fields for a word counter object (from project 7) and a priority queue heap that uses a comparator that takes in key value pairs that have a string (the word) and an integer (the count) and returns the difference between the counts to find which word occurs more times.
In findWords, I used the readWordCountFile of WordCounter from Project 7 to put all of the word-count pairs into a binary search tree. Then using the getPairs method, I made an arrayList of the word-count pairs. Then I looped over every pair in the arrayList and add them to the priority queue heap. Then it loops over the number of items in the heap and prints out the word-count pair returned when removed is called. This will print every word in the word count file in order from the word that occurs most to the word that occurs least. Here is an image of the ten most common words in the 2008 reddit comments file:
"the" is by far the most common word, almost twice more common that even the second most common word.
"deleted" is the first non pronoun, article, auxiliary or linking verb, word that appears in the list, the 27th occurring 74,414 times.
2) The second task was to create a findTrends class to "determine the frequency of a specific set of words in a set of words". It has the same fields as the findCommonWords class.
I made a findTrends method that starts by clearing the word counter object in case more than one list of words is analyzed with the same FindTrends object. Then it calls the readWordCountFile method of the WordCounter object. It makes an arrayList of Doubles called frequency and puts 0.0 in each location up to the size of the words arrayList passed in to the method. Then for each word in the arrayList, it calculates the frequency by calling the getFrequency method of the Word Counter object and then sets the index of the word in the frequency array list to the frequency of the corresponding word in the words array list. At the end it returns the frequency array list.
Then I used a secondary helper method called findTrend with an array of strings (for the args array) and an array list of words. It starts by making an arrayList frequency of word by Year with empty strings the length of the number of words in the words arrayList. Then looping over the 1st index of the args array up to and including the 2nd index of the args array, it calls findTrends from above on the current year file (argsi.txt). Then it loops over that arrayList and sets the corresponding index in the frequency of word by year to its current value plus the index of the findTrends arrayList. At the end, I print the ArrayList.
I put in a list of animals (cat, dog, hamster, fish, rabbit, frog, ferret, hedgehog, turtle, gecko, lizard, alpaca). Here is the terminal output with the frequencies from the reddit comments in 2008-2015:
It shows the animals in just 2008, then in 2008-2015.
I copied and pasted this into an excel document, added a header, then made a graph of the results:
Dog, cat, and fish are by far the most common animal names uses throughout all 8 years. The three words are closest in 2008 then vary but generally grow apart until 2015. Dog and cat become generally more frequent between 2008 and 2015, while fish remains constant. Cat almost catches up to dog in 2009, but not quite. The rest of the words are all much less common and remain constant in their frequentness over then 8 years.
1) The first extension that I completed was to move the findCommonWords output to as text file rather than printing to the terminal so it is easier to look through and understand. To do this, I made a new method called writeFindWordsFile. It starts by clearing the WordCount object, then calling readWordCountFile of the WordCount object. It makes an ArrayList of the pairs then adds each to the priority queue heap. Then I made a BufferedWriter object and in a try loop, I made a new bufferedWriter object and looped over the number of items in the priority queue heap, and writes the key value pair returned by removing from the priority queue heap). Then it catches exceptions and will close the file when necessary. Here is the text file made using this method:
(I tried to upload the file but its too big)
2) The second extension that I completed was to try to make the findWords process faster. I made another method that would create a text file, called findWords that requires an integer as a parameter to distinguish it from the first. It starts by initiating a String object and a BufferedReader object. In a try loop, it makes a BufferedReader object, then uses it to read the next line of the file (to eliminate the header). Then it reads each line of the word counts file and parses it, making the two pieces into a KeyValuePair object, then adding each KeyValuePair to the priority heap. Then it checks for exceptions and closes the file. In findWords, it prints as it removes, in writeFindWordsNewFile, it makes a text file using this new method, so then it writes the removes to a text file just like the previous extension did. I made sure that the file created was the same as above:
Then, to see which worked best, I ran them on the same file (2008) and kept track of how long it took to do each:
The two options actually take approximately the same amount of time, so not much was saved by adding KeyValuePairs directly into the priority queue heap.
3) The third extension that I completed was to write the results of findTrend to a CSV file so it can be opened as a CSV file. I edited findTrend so that at the end I initialized a BufferedWriter object to null. In a try loop, I made a new bufferedWriter, I wrote a header by first writing the numbers in the range of the first index in the args array up to and including the second index of the args array, separated by commas. This makes a header with the list of years that the frequency information was collected at. Then I write every item in the words list followed by the corresponding string in the frequency of words by year. This makes a full csv file with a header line that can be turn into a graph. It looks like this:
4) The last extension that I completed was to test FindTrends on another set of words. I decided to do a list of words related to sexuality to see how their frequency has changed in the last 8 years. I used gay, lesbian, transgender, intersex, bisexual, bi, pansexual, queer and asexual. Here is an image of the csv file (in excel) and the corresponding graph:
Gay is by far the most common word used, although its frequency has decreased since 2012. I wonder if this is because there has been a lot of talk of eliminating "gay" as an insult or because people now use other words to discuss sexuality. Lesbian is the only other word to generally decrease in frequency over the 8 years, and only very slightly. The rest of the words increase in frequency slightly, possibly because there is more knowledge and use of these words in modern culture.
5) The fifth extension I completed was to make a findWords text file for each of the reddit comment counts file (for each year from 2008-2015). Interestingly, they all had the same top 10 words, with the and i always being the top two, and the rest occasionally switching positions.