# CS441 Systems Biology II Describing a Data Set

### Purpose

This project is simply an introduction to looking at a data set, and I also wrote a small program to do some simple analysis and visualization.

### Questions

1. For this project, I found data from 2005 about population density in Cuba, Haiti, the Dominican Republic, and Puerto Rico. data.txt
2. The data is from the LandScanTM Global Population Database, based in the Oak Ridge National Laboratory in Oak Ridge, TN.
3. Every data point has values for its longitude and latitude positions, and a value for the population density for the region. Additionally, each has a 'cell ID', which is based on the latitude and longitude values. Since the ID supplies no additional information, I will disregard it, and consider the data to be three dimensional.
4. The size of my data set is 175 data points.
5. My data is in a comma-separated format.
6. Location values are exactly precise to their intended values since the cells they refer to are centered every half degree, and all of the values end in either .25 or .75, refering to the upper left corner of these cells.
Population density values are precise to 8 decimal places. It is very probable that this is the precision of the format since the values were found using what sounds as though it was a very complex algorithm that would have produced values with many more decimal places.
7. This data set does have missing data. Missing density information is represented by -9999.
8. The range of: longitude = -84.75 to -64.75 (negative indicates the Western Hemisphere)
latitude = 17.75 to 23.75
population density = 0 - 5922.6047316
9. The latitude and longitude do not have error. The values for population density were found using probability coefficients that considered road proximity, slope, land cover, and nighttime lights. More information on these probability coeffiecients would supply appropriate error bars, but, regrettably, I can not find that information.

### Python Program

As an extension, I wrote a (rather inelegant) python program that does several things.

1. opens the file, disregards the first line as a label, then reads each successive line. The interesting part of this was extracting the values from each line. I was not sure if python had an indexOf method for strings, and decided to simply write one myself. It takes a string to look in, character to look for, and an index to start looking at. If the character does not appear after the index, it returns -1, but because I knew what my file looked like, I never had to deal with this problem. I used indexOf to find the commas, and then used the substring capability to extract just the numeric value, which I immediately cast to a float.
2. finds the maximum and minimum values for all three variables, and calculates the mean of the population density.

3. draws a map, color coded so that higher density areas are a brighter blue than low density areas. For this I decided to just use the turtle module in Tk. I used an if/elif block to see which range of values each cell belonged in, and set its color accordingly. To position the map in the center of the window, I altered the values, using numbers specific to this data set. Thus, my program is not flexible. If I were in Java, I would have created an object for each data point a read in, and stored them in an array. Afterward, knowing my ranges, I would have calculated an appropriate offset and scale for the data set, and only then sent the data to be drawn. Instead of figuring out how to do this in python, I drew the points as I read them in, so the program did not yet have the information necessary to calculate the necessary offset.

### Reflection

Since my previous experience in python consisted of one week this past summer ( too short of a time span to remember much from it), doing this project forced me to become somewhat comfortable with python.org, which will be useful this semester. I was confused for a while when my inequalities were not working – it said "17.5 >= 100." I finally realized that my 17.5 was still a string, because I was casting it after I was trying to find the max's and min's. I guess I need to learn to be more careful with untyped variables.

Labels