Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


Note that "rU" here stands for "read universal" and prevents errors when reading potential differently-formatted csv files.
For task 2, we needed to create an analysis.py file, which was just a file full of methods. We had a method to get the range of data (eg min and max) of provided columns, a method to calculate their means, one for their standard deviation, and methods to normalize the columns both individually and relative to each other in a matrix fashion. Numpy had built-in methods for getting the mean, max, and standard deviation (although for this, I did need to specify 1 degree of freedom), so the first three functions were just a matter of getting the headers' info from our dictionary, using that to get the correct matrix column, and doing the operation to it.
On the other hand, I had a different approach to normalizing the columns. For the together one, I first made a 0 matrix similar to the get_data method from before. Then, I looped over all of the columns, and got their max and min matrices and the range between those two. Next, I used a normalization formula I found online, which is high - (((high - low) * (maxs - currCol)) / minMaxRange), where high is 1 and low is 0, because we want everything to range from 0 to 1, and minMaxRange is the different between the given column's min and max. The function then returned a new matrix. For normalizing all columns in relation to one another, I simply made a matrix out of the given columns, then made a new matrix using the same formula, but without looping over the columns, instead choosing a max, min, etc. from the matrix as a whole. To run all of these, the data class had this method:
Image Added

The last task was to test everything on a data set. For that, I found one online regarding race and gender of babies born to women within two different age categories. I made sure it worked, and have provided the file I used. At one point, I noticed a difference in standard deviation, and I found that I needed to specify one degree of freedom within the numpy function for getting standard deviation. As the picture below shows, my result checks out with Excel's for mean and std. dev.: Image Added

As for my extensions, I did one involving treating dates as numeric type data, and one for reading more formats of dates.

...

The second extension was just a matter of checking if the line contained periods instead of slashes. and then checking if the 3rd index (the year) was only 2 digits long. If it was, I would add "20" to the front, because we should assume 21st century if unspecified, and it is convenient to have all years be given as four digits, in the end.

Image Removed

Image Modified

In this project I learned about some file-reading strategies, various numpy methods and usage strategies, and had good practice with keeping track of information throughout the loop with strategically-placed variables.

...