In this project, we made a program that reads in all of the words in a text file and writes a new file containing the total number of words and the times each word appeared. We started off with a key and value pair class, and would add those (using the word as the key and the times appeared as the value) into a binary search tree class. Using a lot of recursion, we sorted, searched for, and otherwise manipulated the nodes in the binary search tree. At the end, we read in Reddit comments files and observed how many words each year had along with the number of actual unique words.
For the first task, we wrote the first and only class within the scope of the project, the word counter class. It has a field for the word count, the map it is storing the words in, and the comparator it uses to compare strings. We wrote a method to read in the lines of a text file, using java's BufferedReader and FileReader. Then, we printed out what was in the tree to check if the tree's ordering system was correct. I wrote a method in my BSTMap class called fancyPrinting to handle this. It recursively added to a string that I returned at the end, with all of the nodes in preorder. This recursion and the other methods' recursion was done by using the node's left and right fields. From the start, we had one "root" node, which we would call a method on, then since it and all other nodes had a left and right node associated with it (unless it was a leaf), we could put the left and right of any given node into a method argument or call a method on them. Below is a picture of how my recursion code looked in general:
For example, to find the size of the tree, I start off with an integer, then check how many other nodes there are. If there are none, there is just one node so I return 1 as the size. Otherwise, I check if there is a left node first (which would need to happen for pre-order, although in this method, it isn't relevant), and if there is, I keep getting the size. This happens until all nodes are checked, and when they are, they return 1 over and over, adding to the initial size integer from the first traversal.
On a related note, to make the spacing more tabbed in the deeper you go in the fancyPrinting method, I checked how deep the tree was at every point in the recursion. This was done with a getDepth method.
Task 2 was to make a method in the counter class to write the info about words to a text file. I used the BufferedWriter and FileWriter to write into a file. I first made a new file, then wrote the total word count at the top of the file. Then I looped through the arraylist of pairs obtained from the getPairs method in the map, and printed a key and value (word and count) on each new line. I had a problem where all the counts were "1", which I fixed by using .equals() when comparing the key string instead of ==, which used the strings' indices, apparently.
Below is a picture of how I wrote into a file. Note that I kept track of how many words had been read so I could print every so often, useful for checking progress with the Reddit files later. I make a new FileWriter, then I use that through the BufferWriter. I have to flush and close the file to make the words show up properly.
Task 3 was to make a method that would read the file with the word counts in, and make a new map out of that. Like the original reading method, I scanned the lines in a while loop that advanced while the newly-read line was not null. After reading the line once and doing nothing with it, I skipped the first line. Next, the way I distinguished the word from the count was to check if the word count was even or odd (only the word and count lines were added). If it was reading in an odd-count word, that would be the key, and the second one would be the count. I added each kind to its respective arraylist storing all of them, created in this method. At the end of the file loop, I looped through the lists and on each iteration, added the current loop value index of the key list to the map. I looped through the count of each line and updated the frequency of the word each time.
For task 4, we had to test our methods all in a row. First, I read in the original file. Then, I wrote the count file for it. Next, I read in the count file, and used the method from task 3 to build a new map. Lastly, I wrote a redo count file from that. Realizing the point of this task, I added in code at the beginning of most of the methods that reset the count to 0 and the map to a new one. This is required so we can start off with a clear map for everything we read in. I had a method that did all of these things called reconstruct.
In task 5, we had to read in all of the Reddit text files and make count files for them. This was the hardest part of the project, because at this point I hadn't been using recursion when I wasn't explicitly asked to, instead using for loops in many methods of the BSTMap class. I had to go back and change everything to recursion. The for loop takes a very long time and is therefore inefficient when processing a large amount of data like this. After doing that, I still took twice as long to process files as expected, and Stephanie helped me. To fix this, I changed my getCount method in my counter class so it only got the count once, stored it in a variable, and referred to that when needed, instead of creating it twice to "save lines" like I wanted to do... And I also wrote directly to the file instead of storing everything in a huge string, then writing that all in.
Task is was to analyze our findings, so I will do that now:
The total words per year increased a lost from 2008-2010, then leveled off. Meanwhile, as seen in the graph below, the number of unique words used is steadily increasing. I noticed lots of the words were nonsense exclamations; maybe people are coming up with weirder words as time goes on. Unfortunately, filtering these out in my program would prove difficult, so they count as real words here.
I did the first extension, which was to exclude common words when counting. To do this, I made an array of strings, added in those words, and if they were not included in the read line, I would increment the word count and add the word to the map as per usual. Here is my code for that:
Note that I had to search the array as if it were a list and check using .contains() to find out if a line had an excluded word.
I also added the ability for a person to type in a word into the command line and get that word's frequency to print, although I won't say that's an extension. I did do (see: attempt) the memory loss extension, which can be seen in my code.
In this lab, I learned how convenient for me yet inefficient for performance for loops/foreach loops are. I got used to using recursion again after not seeing it for a while. I also learned how to read and write files, which is extremely useful.
Thanks to: Stephanie, Melody, Bruce