For this project, the main goal was to read in csv files with data corresponding to headers, and to categorize and manipulate them. We had a file called data.py that read in the file and stored info into lists, dictionaries, and matrices. Then we made some useful accessor methods to get that information. In the analysis file, we had a few methods, such as one to calculate the mean of one or more matrix columns, and one to normalize the columns separately. The end result was a bunch of printed info in the terminal that tells us about the data from a given file.
Task 1 was to create the algorithm for reading the file in, and to then make the accessor methods.
I read in the file by using the csv package's reader. I split the text on commas and accounted for ignoring potential comments in the csv files. Then I looped over all of the lines in it (which we will call rows). If it's the first row, it's a header, if it's the second, it's the types, so I checked for those two indices first in the loop. I used variable to keep track of which column I was on for various parts of the whole file's loop, which was useful when building dictionaries. In a big else statement that now knows we have a regular data row if it wasn't a header or type, I also needed to tell the program to check if the data was numeric or not. I did this by trying to convert the string to a float and catching the exception. If we can convert it, it must be numeric, otherwise, it isn't. After doing all this, I made the accessor methods, which were mostly a matter of returning lists that we'd already built, or else manipulating them a bit. The one that gave me trouble was get data, as I was trying to fetch the columns separately then combine those matrices, and I got help from Stephanie in the end, who told me to make a 0 matrix of the return matrix size first, then to fill it in. This was much easier. Incidentally, here is a snippet of the start of my reading loop:
And this was my logic for testing whether or not something should be a numeric type:
Note that "rU" here stands for "read universal" and prevents errors when reading potential differently-formatted csv files.
For task 2, we needed to create an analysis.py file, which was just a file full of methods. We had a method to get the range of data (eg min and max) of provided columns, a method to calculate their means, one for their standard deviation, and methods to normalize the columns both individually and relative to each other in a matrix fashion. Numpy had built-in methods for getting the mean, max, and standard deviation (although for this, I did need to specify 1 degree of freedom), so the first three functions were just a matter of getting the headers' info from our dictionary, using that to get the correct matrix column, and doing the operation to it.
On the other hand, I had a different approach to normalizing the columns. For the together one, I first made a 0 matrix similar to the get_data method from before. Then, I looped over all of the columns, and got their max and min matrices and the range between those two. Next, I used a normalization formula I found online, which is high - (((high - low) * (maxs - currCol)) / minMaxRange), where high is 1 and low is 0, because we want everything to range from 0 to 1, and minMaxRange is the different between the given column's min and max. The function then returned a new matrix. For normalizing all columns in relation to one another, I simply made a matrix out of the given columns, then made a new matrix using the same formula, but without looping over the columns, instead choosing a max, min, etc. from the matrix as a whole. To run all of these, the data class had this method:
The last task was to test everything on a data set. For that, I found one online regarding race and gender of babies born to women within two different age categories. I made sure it worked, and have provided the file I used. At one point, I noticed a difference in standard deviation, and I found that I needed to specify one degree of freedom within the numpy function for getting standard deviation. As the picture below shows, my result checks out with Excel's for mean and std. dev:
As for my extensions, I did one involving treating dates as numeric type data, and one for reading more formats of dates.
First, within the file-reading loop, I checked if the given piece of data had three slashes. Then I checked if the first number could be converted to a float, eg it was a number. Using a field with the epoch, and calculating the time since that, then converting the date information to a datetime object, I was theoretically able to make dates have unique numbers and be treated like numeric types, as a result.
The second extension was just a matter of checking if the line contained periods instead of slashes. and then checking if the 3rd index (the year) was only 2 digits long. If it was, I would add "20" to the front, because we should assume 21st century if unspecified, and it is convenient to have all years be given as four digits, in the end.
This following pictures show dates working, and show the epoch time matching, respectively.
In this project I learned about some file-reading strategies, various numpy methods and usage strategies, and had good practice with keeping track of information throughout the loop with strategically-placed variables.
Thanks to: Melody, Stephanie