Skip to end of metadata
Go to start of metadata

For this project, we made a program that takes in a pre-made word count file, which displays a word and the amount of times it appears in a text file, and reported the count of any given word. With the FindCommonWords class, the user can specify a file and the program will return the top ten most frequently used words. To implement this, I used a Priority Queue Heap, which is structured as a max heap. The root is always the biggest value, and if it is removed (the only item that can be removed at any point is the root), the last value takes its place, then the structure has to make the largest value the root again. If something is added, it is put in the last spot, then the heap shifts in case/while the added item is bigger than a parent. The implementation for the heap we used was an array this time instead of nodes like most of the projects. As expected, the most common words tended to be pronouns and forms of "to be". For another file, we created a way to find the frequency of user-provided words over the scope of all the reddit comment files, and reported our results through a graph.

Task one was to make the FindCommonWords file. I had fields for the word count, the heap, and the comparator. This class has one primary method, which the main method calls after passing in the file name. It reads the first two lines to skip them, then while reading the rest of the lines, count is incremented because each line has a new word. The line is split by the white space, which separates the word from its count, and the two parts of each line were added to the heap, which had a word as a key and its count as the value. For finding the top ten numbers, I used the remove method in the heap class and called it ten times, printing the result each time. The results for 2008 are below. The word "I" is the most common, predictably.

Task 2 was to make a FindTrends class. This would take in quite a few arguments, which were, in order: base file name, start year, end year, and word(s) to be found, separated by a space if there were multiple words. The fields I had were a string arraylist of the desired words, an arraylist of arraylists containing the frequency of each word in a map, an int start and end year, a string for the base file name, and I had one method (other than the constructor and main) plus a helper in this file. The primary method was readFiles, which uses the base string and the years to formulate a String with the file name first so it would know what to open. Each file has a BSTMap made for it, along with an arraylist into which we put the frequencies of each word within the map. The helper method searches the provided map, and puts it into the array parameter if it finds the word I passed in. At the end of the redFiles method, I loop through the arraylist containing each arraylist, and print out the frequency for any given map using the loop index to keep track of which word I am on. The way I formed each new file name and the map and frequency list I had for said file is pictured below:

The way I split the args is below. Notice that I made a list of all words the user wanted investigated. Also, the start year and end year start off as strings here, but in the constructor, I turn them into integers.

Task 2 also asked us to develop a line graph showing results of a certain word theme, revealing patterns in frequency over the years. I used the first provided list, which seemed to have a theme of "technology".


The word "friend, as expected, had a high frequency to start with, because it can be used in contexts outside of Facebook. But obviously, Facebook's growth in popularity also led to a higher frequency for the use of "friend", in relation to the Facebook term, presumably. This spike happened between 2009 and 2010. Meanwhile, the spike of popularity in "ipad", "portal", and "syntax" happened between 2010 and 2011. It is interesting to note that "ipad" and "syntax" were extremely similar in their frequency levels and growths. Perhaps people on reddit were talking about the syntax that iPads use, or the way they handle it, including correction. Portal saw a spike around 2011 as well, likely due to the release of the popular sequel to the video game "Portal". It came out in April of 2011, so of course, given that the comments are all from May, there would have been a lot of talk. Lastly, "sony" was not spoken about that much, and remained fairly constant throughout the years. In comparison to Portal, it did not have an explosion in popularity, but rather enjoyed a steady rate of use. This is likely due to the Sony user base being smaller but more devoted to their systems in general, rather than, say, xbox, which casual gamers use much more often, leading to a larger user base.This is, at least, what I think from my experience as a Sony fan.

I did the first extension, which involved producing my own list and analyzing their frequencies. I made a graph, with the theme of seasons in mind:

The first thing I noticed was that not many people speak about winter. Given that the comments are from may, the most spoken-about season is summer, because people are likely waiting for it to come and to finish their school year. Summer seems to peak at 2011, and winter has a slight correlation with its rate, perhaps indicating that people speak about summer and winter in comparison. Rising roughly at the same time as summer and falling about as quickly is "outside", which I interpreted as people talking about what they will do outside in the summer. Meanwhile, spring is spoken about the least, even though the comments are from during springtime. And last of all, almost no one speaks about "autumn". I was going to use the word "fall", but I realized that another meaning for the word could get mixed in and decided not to. Use of the word autumn, however, seems to stay about constant, as much as winter and spring (in comparison to summer). The spike in popularity of summer may be due to people waiting for summer to come after having a harsh winter. Looking into it, it seems like 2009-2010 was the most rough winter for the US, while 2011-2012 was not very bad. This may explain the rising popularity of "summer" during 2010 and the falling popularity during 2012.

I also did the data loss extension, which can be seen in my files.

This project was, overall, not as intensive as some other projects, so it flowed for me as I went along. I suppose that the most challenging part of the project was probably figuring out how to deal with multiple files and report results based on all of them. It was good practice in manipulating strings, which I did to get the desired file name. I also enjoyed figuring out how to store the frequency data of each file, which I ended up doing with an arraylist of arraylists, and printing all of the data out in the end. Also, I had a lot of practice with specifying file locations, as well as with dissecting the command-line input.

Thanks to: Melody, Stephanie

Labels