The main data structure for this week's project was Binary Search Tree (BST), which implements MapSet interface. BST stores data as (key, value) pairs and inserts data in the order of keys. Here is how inserting a data into BST works. Suppose the type of key is string and the type of value is integer. The keys inserted after the first key, which is the root, will be placed either left or right to the root depending on their alphabetical order to the root. Therefore, the keys placed in the left of the root would be alphabetically smaller than the root while those in the right would be bigger than the root. This order of (key, value) pairs is why BST was used for this project. Given a large text file, I may want to analyze its certain aspects such as the total number of words in the file. If I were to store the information of the file in an arraylist or linked list, I would have to traverse the entire list to find a certain value given the corresponding key. In other words, the data structures we have used so far would be inefficient for this type of project. On the other hand, BST would be able to quickly find the value, given its key, because it simply has to search through the tree using recursion. "Recursion" is another important topic of this week's project because it was essential for implementing BST. To be concise, recursion was used by testing whether left or right subtree is null and deciding whether to continue to move down the tree. Once I implemented my own BST, I built another class WordCounter to utilize BST and obtain certain data about the text files, such as its total number of words and the frequency of certain words in the file. At the end of the project, I was able to obtain specific data of 8 different text files and compare the data between them.
: This method is similar to the read() method from project 4's Board class. First, it reads in the text file using BufferedReader class. It then uses while loop to loop through the lines in the text file and conditional statements to skip certain conditions such word has a length of zero. The difficult part of this method was to figure out how to update BST map appropriately. I used BST's get() method to see if a key's value was null in order to test whether the key is already in the map. If the key is already there, I simply incremented the value of that key.
: Based on the BST map that was built by using the analyze() method on the text file, this method writes a word count file, which summarizes the unique words, i.e. keys, and their frequencies, i.e. values. I first created a new file using FileWriter class and looped through an arrayList of BST's (key, value) pairs. To do so, I used BST's entrySet() method which returns an array list of (key, value) pairs in a pre-order traversal. Therefore, the word count file is written in pre-order traversal of BST. Below is an image of word count file from counttest.txt.
: This method is slightly different from analyze() method because it reads in word count file, which is already processed by analyze() method. It still reads in a text file using BufferedReader class and uses while loop to loop through the lines. However, it skips the first line because the first line of word count file will always contain information about total word count. With the rest of the lines, which only has two words: key and value respectively, it uses the information of lines to put an appropriate (key, value) pair in the BST map. In order to read in the first line separately, I used equals() method.
I created this test class in order to test if my write and read WordCountFile function properly. After I ran this file, I used diff command on the command line to determine if counts_ct.txt and counts_ct_v2.txt are identical. Below is the output of running diff on the command line. (if there is no difference, there should be no output)
After I confirmed that my write and readWordCountFile() methods worked properly, I analyzed eight different text files and obtained below data.
The last graph clearly fits my expectations. As the total word count increases, the processing time(which is in seconds) increases as well.
Write a toString method that prints the tree level by level from top to bottom.
I added toString() method to BSTMap class and its TNode class to print tree level by level from top to bottom in pre-order. In order to call recursion to root and its left and right nodes, I initialized a string variable inside BSTMap's toString() method. Then, I passed it in to the recursive call as a parameter. In order to increment the tab as the recursion goes on, I used for loop inside TNode class's toString() method.
Below is the image of calling toString() method in BSTMap.java. The output is very similar with the example one on the project's instruction page.
Have your WordCounter main function finish by printing out the 20 (or N) most frequent words in the document.
For this extension, I added a new method mostFrequent() in WordCounter.java. It returns an arrayList of keys that have the n most frequent words in the file. N is determined by the int parameter of the method. The mechanism is that I first create an arrayList of every key's frequencies using getCount() method of WordCounter class. Then, I use Collection class's sort and reverse method to make frequency list into an ascending order. Using another list, I save the first twenty values of the frequency list. Lastly, I loop through the key list of the BST and tests whether the newly created list contains the key's value. If there is, I add that particular key to the arrayList of keys. At the end, I simply return that arrayList.
Below is the output getting 20 most frequent words of reddit_comments_2008.txt. Please uncomment the last two lines in WordCounter's main method to observe the output below.
Simple User control
In order to enable user to relatively easily obtain time processing and unique word counts, I created a method process() in WordCounter.java. It can return processing time of each text file or unique word counts depending on the parameter. The parameter is determined through command line. Please type 0 to obtain processing time and any other integer to obtain unique word counts of each text file.
For this project, I received help from Professor Layton and Maxwell, TAs, and Qingbo.