Skip to end of metadata
Go to start of metadata

Introduction:

The main data structure for this week's project was Hash Table (HT), which implements MapSet interface and stores KeyValuePairs like Binary Search Tree (BST) from last week.  There are several ways to implement HT: 1) open or 2) closed.  If open, array-based HT would have another data structure, such as BST and ArrayList, at individual indexes of the array.  If closed, the array would simply store KeyValuePairs at the indexes.  Another major difference between open and closed HT is their approaches to deal with "collision".  Collision refers to a situation when a given key is hashed to an index where there is already a KeyValuePair.  Open HT would simply add this pair with same hash value to whichever data structure exists at that index.  On the other hand, closed HT would simply insert this key with same hash value to the next available index.  Thus, closed HT performs linear search to deal with collision, which can slow down the process.  I will discuss later how I dealt with this possible issue of closed HT.  To be concise, the major task of this project was to observe which data structure, between BST and HT, was more efficient in regard to processing a relatively large amount of data.  

Main Code:

Hashmap{} 

Hashmap class implements closed hash-map, which means there is no additional data structure within the map's array.  One thing to note is that there is an additional instance variable of array, hasBeenUsed.  This array is used to keep track of which index is not null.  It will be used in getIndex() method.

hash()

For hash() method, I simply use java's hashCode() method, mod operator by the length of array, and absolute function to ensure that it returns an appropriate value.

getIndex()

This method uses while loop to search for the given key in the array.  This is when hasBeenUsed array is used.  If it fails to find the index of the given key, it returns a negative value.

ensureCapacity()

I enabled ensureCapacity() to be called whenever the array is half-full, instead of completely full.  This is how I deal with the issue of linear probing in closed HT.  When the array becomes half-full, I save the original data first, double the size of array, and copy over the original data back to the array.

put()

The put() method uses getIndex() method to see if the given key is already in the map.  If so, it simply updates the value of that key.  If not, it then looks for a next available index to insert this new key.  Again, I use while loop to do this.  One thing to note is that I increment the number of collisions by one only in the while loop.  The reason is that only the first insertion of a given key can be a collision.

get()

The get() method uses getIndex() method to find the index of the given key.  If the returned index is negative, it means the key does not exist in the array.  If not, it finds the pair at that particular index and returns the pair's value.

Hashtree{} (this is my extension that implements one more collision handle method)

Hashtree class implements open hash-map, which uses BST at each index of the array to store data.  However, it is still an array-based HT.  Because it uses BST, it takes in a comparator as one of its instance variable.

put()

Hashtree's put() method is slightly different from Hashmap class.  It does not need to use getIndex() method because it doesn't have to check if the key is already in the array.  BST's put() method updates the value automatically if the key already exists.  If not, it instantiates a new BST at that index and puts the (key, value) pair.  

WordCounter{} 

In order to enable WordCounter class to take in different type of data structures, I needed to change the instance variable to MapSet<String, Integer> type.  I also created a public enumeration Type for the methods below.  Based on the instance variable of enumeration Type, the process() and automate() return different kinds of data, such as depth of tree or number of collisions.  The type is controlled by the command line argument in the main method of WordCounter.java.  When running the main method WordCounter.java, please type in BST, HASHMAP, or HASHTREE to implement different data structures for processing the text files.

process()

I created this method to simplify the process of calculating the time taken, total number of words, number of unique words, depth of tree, and number of collisions.  At the end, the method returns an array of five values in total.(the last value is the memory used from analyze() method; this is extension.)

automate()

I used for loop to execute process() method five times, record the data to calculate averages, and return the outputs as an ArrayList.  I used Collections packages' max and min() methods to discard the high and low record times.

Automate() return the results in following order: 1) Time record, 2) The total number of words, 3) The number of unique words, 4) Depth of tree/number of collisions, and 5) memory usage.

Graphs

For all of my graphs, but one, blue represents BST while red represents closed hash-map.

By just looking at the relationship between year of text file and each run time, the trend is not clear.  

 

 

The relationship between the total number of words and the run time is quite clear; the more words there are in the file, the longer it takes to process them.

 

The relationship between the number of unique words and the run-time seems to follow the trend above.

 

Both BSTMap and Hash-table seem to perform ideally using total words versus run-time, rather than file size versus run-time.  There seems to be no associate between file size and run-time.  As file size increases, run time does not always increase.

Although the run-time tends increase as the number of collisions increases, the relationship is not as strong as that between total number of words and run-time.  In fact, the relationship is not linear.  

Regarding the BSTMap, I don't see any linear relationship between the depth of BST and its performance.


Extensions: 

  • Try implementing more than one collision handling method. For example, (1) use a linked list instead of a BSTMap at each table entry, or (2) use a closed hash table (no extra data structures). Compare performance.

Because I used a closed hash-map for the main tasks, I implemented an open hash-map based on BST.  Below are some differences between the performances of open and closed hash-maps.  Please refer to Hashtree.java for my implementation of BST based open hash-map.

Blue: closed hash-map ; Red: open hash-map

It is quite obvious that open hash-map has far fewer number of collisions.  

However, the closed hash-map has slightly faster run-time than the open hash-map.

Blue: closed hash-map ; Red: open hash-map

 

  • Examine the space efficiency of your implementations. One way to do this would be to refrain from using the -Xmx512m flag. Count words of files of increasing sizes. See what is the size of the smallest size that crashes your program. Does a smaller file crash the Hashtable code or the BSTMap code? Why?

I approached this extension slightly different from the instruction.  I measured the memory usage of each analyze() method and recorded it through process() and automate() methods.  To calculate the memory usage, I used Runtime package's totalMemory() and freeMemory() methods.  The process of calculation is similar to  that of calculating the time taken.

I then compared the memory used by BST and closed hash-map.  Note that the unit of memory in the graphs is MB although the memory usage returned from automate() method is in KB.

It is clear that the closed hash-map used less memory than BST did.  This is reasonable result because BST requires more memory usage due to its nodes while closed hash-map simply stores data in the array.

 

  • Improve the time-efficiency of one of your data structures. Explain what you did and report the improvement in speed

I decided to improve the time efficiency of my closed hash-map and modified my ensureCapacity() method in Hashmap.java.  Initially, my ensureCapacity() doubled the size of the array when it became full.  Below is the comparison of run-time of closed hash-map with different ensureCapacity() methods.  The blue line demonstrates the run time when the size of array is doubled only if it is completely full.  The red line shows the run time when the size of array is doubled if it is 50% full.  The difference is not trivial.

If I compare the run-time graph in the left with that in the right, I can see ensureCapacity() method can influence the time-efficiency.  The more frequently I call ensureCapacity() method, the faster the run-time is.   Below is where I make a change in ensureCapacity() method.  I would divide data.length by 2 in order to double the size of array when it is half full.

  • Extra Graphs

Because I recorded the memory usage for each data structure, I wondered if there is an association between the number of unique words and the memory usage. 

Both graphs (BST & Hash-Map) demonstrate that as the number of unique words increases, the memory usage increases as well.  These are reasonable outputs because the more there are unique words, the more nodes or space in array is needed to fit the new (key, value) pairs.

Conclusions:

First of all, I learned how to implement hash-map, using both open and closed structure.  More importantly, I was able to observe time/space efficiency of different data structures.  It was fascinating to confirm in practice that closed hash-map, which is array-based, used less memory than BST.  

For this project, I received help from professor Maxwell and Nhi.

Labels