For this project, the main goal was to read in csv files with data corresponding to headers, and to categorize and manipulate them. We had a file called data.py that read in the file and stored info into lists, dictionaries, and matrices. Then we made some useful accessor methods to get that information. In the analysis file, we had a few methods, such as one to calculate the mean of one or more matrix columns, and one to normalize the columns separately. The end result was a bunch of printed info in the terminal that tells us about the data from a given file.
*Work in progress, obviously*
Task 1 was to create the algorithm for reading the file in, and to then make the accessor methods.
The last task was to test everything on a data set. For that, I found one online regarding race and gender of babies born to women within two different age categories. I made sure it worked, and have provided the file I used. At one point, I noticed a difference in standard deviation, and I found that I needed to specify one degree of freedom within the numpy function for getting standard deviation. As the picture below shows, my result checks out with Excel's for mean and std. dev.
As for my extensions, I did one involving treating dates as numeric type data, and one for reading more formats of dates.
The second extension was just a matter of checking if the line contained periods instead of slashes. and then checking if the 3rd index (the year) was only 2 digits long. If it was, I would add "20" to the front, because we should assume 21st century if unspecified, and it is convenient to have all years be given as four digits, in the end.
In this project I learned about some file-reading strategies, various numpy methods and usage strategies, and had good practice with keeping track of information throughout the loop with strategically-placed variables.