Summary: The goal of this project was to create a data class that would hold data from a csv file both in its original format and in a numeric format that can be used for analysis. I used a list of lists to hold the strings of raw data and a numpy matrix to hold the floats of the numeric data for analysis. Once the structures for holding that data were complete, an analysis file was made to use the numeric data matrix and gather useful information about the data contained. I used a data set that holds information about college students to ensure that my data class and analysis functions were working properly, and the analysis was successful.
1. The first task was to create the Data class. The fields that I had in my data class summarize the information that I gathered and stored about the csv file:
self.filename: this stores the name of the file to be read
self.raw_headers: this takes a list of the strings in the first line of the csv file, which contains the headers of the columns in the data set
self.raw_types: this takes a list of the strings in the first line of the csv file, which contains the types of data contained in the column
self.raw_data: this is a list of lists of strings that corresponds to each row of the csv file (excluding the first two rows)
self.numeric_data: this is a list of lists of floats that corresponds to each row of the csv file, including only the columns with numeric data (and excluding the first two rows)
self.matrix_data: this is a numpy matrix that contains the columns chosen for analysis
self.header2raw: this is a dictionary that has all of the headers as keys and the corresponding column number as the values
self.header2numeric: this is a dictionary that has the headers of numeric data as keys and the corresponding column numbers as values
self.header2matrix: this is a dictionary that has the headers of the columns that are selected to be in the matrix for analysis as keys and the corresponding column number as the value
The read method takes in all the data and puts it into the appropriate fields. It starts by opening the file, and reading each row. It splits the first row by commas and puts it into raw_headers and splits the second and puts it into raw_types. Then for the rest of the lines, it splits the line by commas, and puts the lines into a list. Then it loops over the headers and puts each into a dictionary with its matching index. Then it loops over the headers again, checks if their corresponding type is numeric and if it is puts the header into the header2numeric dictionary with the corresponding index as the value and puts the header into the header2matrix dictionary with the count (the number of numeric values so far) as the value.
Then it makes self.numeric data by picking the columns of raw_data that have numeric data (the indices are picked by gathering the indices of the header2numeric dictioanry). Then matrix_data is made by turning numeric_data into a matrix and numpy functions to turn -9999.0 into numpy.nan, which it will ignore during analysis.
This is followed by a collection of accessor methods:
get_raw_headers: returns all headers
ger_raw_types: returns all types
get_raw_num_columns: returns total number of columns in the data set
get_num_rows: returns number of rows
get_raw_row(index): returns the index-th row
get_raw_value(i, j): returns the value in the ith row and jth column
print_raw_data: prints the raw data line by line
get_headers: gets only the numeric headers
get_num_columns: returns the number of numeric columns
get_row(index): returns the row with only numeric data
get_value(i,j): get the value in the ith row, jth column of the numeric_data array
Then there is one more important method called select. This selects data out of the matrix_data to be used for analysis. it takes in a list of headers, uses the header2matrix dictionary to find their index, and then returns self.matrix_data[ :, head ], which contains only the columns corresponding to the inputted headers.
I first tested these methods with the test file given in lab. A more robust test of all method with a larger data set will be given below.
2. The second task was to create an analysis.py file that uses data structures from the Data class to analyze the csv file contents. The first function finds the range of each column of data selected. It takes in the headers of the lines it should analyze and a data object. Then it uses the data.select(headers) method to create a matrix with the desired headers. It uses the numpy nanmin and nanmax methods to get a list of the mins and maxs, then loops over those two lists and combines them so that there is a list of lists of the min and max for each columns.
The second function finds the means of the selected columns. It uses the numpy nanmean method on the data.select(headers) matrix.
The third function similarly finds the standard deviation with numpy's nanstdev method.
The fourth function normalizes the individual columns of the selected matrix. It first uses data.select(headers) to get the desired matrix. Then it finds the nanmin along the 0 axis so it finds them for each column. Then it divides the matrix minus the list of mins, by the range (which is the max - min).
The fifth function normalizes the entire selected matrix. data.select(headers) get the desired columns of data, then nanmin is used to find the minimum of the entire matrix. the matrix minus the min is divided by the matrix max-matrix min.
I tested these methods with the test file that Bruce sent out. This contains a combination of numeric and other formats, and it has extra white space to ensure that the read method of the Data file can deal with those situations. Here is an image where first the raw data is printed, then the numeric data is printed (so it successfully reads only the numeric data in as floats, then it correctly selects two columns and successfully analyzes the data:
3. I tested the Data class and analysis functions using a data set of information about college students that I borrowed from my Statistics class. I cut it down to only 20 rows of data so that it is easier to verify the results of analysis.
Staring with the accessor functions in the Data class: this is an image of the terminal output. There is a print statement for each method accessor method for what it should print and what it does print. They all work perfectly:
Then I tested the analysis functions: I created a list of headers to be used for the analysis. I then have them print in order of how they appear in the csv file because this is how they results of the analysis will print.
Then I printed the result of the range function:
These match the range that Excel gives for the same columns:
Then I printed the results of the mean and standard deviation functions:
These match the mean and standard deviation according to Excel:
Then I printed the matrix normalized by column:
This matches the matrix that I made when I normalized it by hand:
And finally I printed the matrix normalized as a whole:
The values match when I normalized the matrix myself:
Thus, all of the methods and functions work properly.
1. The first extension that I completed was to allow the file to take in xls files as well as csv files. I wrote a function called xls_csv that uses the xlrd package. I used xlrd to open the xls workbook and the sheet. Then I made a csv file and made a csv writer. Then I looped over every row of the xls file and added it to the csv file:
for row in xrange(sheet.nrows): write.writerow(sheet.row_values(row))
for row in xrange(sheet.nrows):
Then in the data constructor I check if the filename ends in .xls, and if it does I call xls_csv on the file name and then call self.read on "convert_" + filename[:-4] + ".csv".
I saved the same csv file as an xls file and ran the program. Here is a picture of the result:
It's the same, so my function works!
2. The second extension that I completed combined extension options 2 and 3 and converts various date formats to numeric data. The one assumption that I made was that the dates would be in American format ie month/day/year. I wrote a helper date_conversion function to do this. It takes in a date string and first splits it by the "/". Then it checks if the last item in the list (the year section) if less than 3 characters long. If so, I need to finish the year ( for example, 68 into 1968). I decided that if the number is less than 18 than to add 20 in front and otherwise add 19.
Then I check if the year section is longer than 4 characters long. This means that the date includes a time stamp as well. I split this item of the list by the space in between the year and the time, and add the time to the end of the list. Then I make a string of the time that can be converted to time since epoch:
epoc_time = (date - datetime(1970, 1, 1,)).total_seconds()
Then I return the epoch time.
In the read method, when I loop over the raw_types looking for numeric variables, I add a second if statement to look for date variables. Then I add convert the data to the numeric version using date_conversion. The down side to doing it this way is that the numeric equivalent is put into the raw_type list rather than the date string. I made a csv file that contains dates and then printed the numeric data to see make sure it was added:
3. The third extension that I completed was to add the ability to add a column of data. I made an addColumn function that takes in a list (a column of data). It takes the first item in the list and appends it to raw_headers, the second and appends it to raw_types, and adds it to header2raw. If its type is numeric, it gets added to header2numeric and header2matrix. Then I made an array out of the remaining items of the lust, and hstack raw_data and the column array. If it is numeric, it is also hstack-ed to the self.matrix_data. I added an age column to the student survey data. This successfully worked because when I add the column and print the matrix data:
The 18-22 at the end is the age field that I added.
4. The fourth extension that I completed was to do some other simple data analysis. I added another function to the analysis file called mode. It returns a list of the mode and the frequency of that value for each column of the matrix. If there are more than one that tie for the mode, it returns the smallest. This function uses data.select(headers) to get the appropriate matrix, then uses stats.mode from the scipy package to return a tuple with the modes followed by the frequencies. I looped over this and paired the mode and frequency together for each. I ran this on the same csv file as before and got the following result:
This is accurate, based on the data provided.
What I learned: This project taught me a lot about taking advantage of numpy matrices and commands. I was able to get my normalize functions down from 20 lines to 3 by learning more about the capabilities of numpy. I look forward to learning more in the future.
Thanks to Prof. Bruce Maxwell and Prashant for assistance.