Skip to end of metadata
Go to start of metadata


The main purpose of this project was to integrate a simple linear regression-feature into my GUI, which can display a given data set in three dimensions, enable the user to interact with the displayed data, and report some data-analysis to the user.  Another goal was to implement a multiple linear regression as a function in analysis class.  In summary, there were two tasks: 1) create functions that perform single and multiple linear regression given a data set and 2) incorporate these functions to the GUI by taking the inputs from the user.  Writing functions for single and multiple linear regression was straightforward thanks to the scipy and numpy module.  In order to receive the user's inputs, I created a dialog window that contains three list-boxes for one dependent and two independent variables.  After completing the necessary tasks, I can now enable the user to read in a data set, display it in three dimensions, and apply single & multiple linear regression graphically.  I will explain the process of incorporating multiple linear regression and other extensions later.

Instructions for Linear Regression:

Follow these instructions to run a simple linear regression on the dataset 'data-simple.csv'.

  1. Before running linear regression, the user first must open a file and read it by pressing 'Control-o'.  Once the data set is loaded into the second listbox, the user should then select the dataset and click the button 'Plot Data'.
  2. Once the data is plotted, the user should press 'Control-l' to run linear regression.  If done correctly, the user will see a dialog window (Figure 1).  For simple linear regression, the user should not make a selection in the listbox labeled 'Indepedent Variable1'.  
     (Figure 1)

    If the data is not loaded and plotted, the below error message (Figure 2) will pop up.
     (Figure 2)

    If the user clicks the button 'cancel' or does not make any selections from the listboxes, different error messages will show.

  3. If the user plots X0 on the X-axis and Y on the Y-axis and selected Y as the dependent variable and X0 as the independent variable, below image (Figure 3) will be plotted.
     (Figure 3)
    Note that I reported the slope, intercept, and r-value in the right control frame.

Required Images:

In order to test whether my linear_regression() function works properly, I created a test function called 'test_multiple_regression', which takes a filename as a parameter.  I call this function in the main method of analysis class, so simply run my to test this test function with filename as a command line argument.  I tested with following data sets: data-clean.csv, data-good.csv, and data-noisy.csv.  Note that my test function will work just for these data sets.

The following images show the results of running my linear_regression function on data-clean.csv, data-good.csv, and data-noisy.csv, respectively.

After checking the values printed from my function with those on the webpage, I confirmed that my linear_regression() function works properly.

To test my linear regression functions even further, I used my own dataset called 'mydata.csv', which contains data related to the economy of the U.S. from 1970 to 2015, and performed simple linear regression on following variables: 'Government Spending' as the independent variable and 'Real GDP' as the dependent variable.  See Figure 4 for the result.

 (Figure 4)

Despite some outliers, the data shows a weak, positive linearity (R-value is 0.585).  This weakly positive relationship makes sense because government spending constitutes an open & closed economy.  In economics, this can be explained using following equation: Y (output, i.e. GDP) = C (consumption) + I (business investment) + G (government spending) + NX (net exports).  All else fixed, an increase in government spending should lead to an increase in the output.  Thus, government spending and real GDP should have some positive linearity.

Because I mentioned business investment while discussing an equation above, I decided to perform multiple linear regression using 'Government Spending' as the first independent variable, 'Corporate Investment' as the second independent variable, and 'Real GDP' as the dependent variable.  Figure 5 shows the data plotted in my GUI.

 (Figure 5)

The numerical results can be found in the right control frame.  Given that the r-value decreased from that of previous analysis, government spending and corporate investment do not seem sufficient to explain the variance in real GDP.  Intuitively, this makes sense because other factor, such as consumption and next exports, affect real GDP.


  • Incorporate multiple linear regression into your GUI. Start by just displaying the coefficients of the fit, then extend the GUI to display the regression line in 3D. For fits higher than 3D, you have to be careful when calculating the endpoints of the best fit line in the view space.

    : First, I needed to make my linear_regression() function in analysis class report the minimum and maximum values of the two independent and one dependent variable along with other coefficients.  I then made sure the dialog window for lienar regression have to list-boxes for the independent variables.  Once I ensured that the user can provide inputs for multiple linear regression, I modified buildLinearRegression() function in display class to call simple or multiple regression based on the user's inputs on the dialog window.  If the user provides an input for the second independent variable, then buildLinearRegression() function calculates and creates four endpoints based on the minimum and maximum values of the variables.  It then creates four tk Line objects to represent a plane.  The trick was to figure out which values to use for the endpoints and which endpoints to use for each line object.

  • Further extend your GUI in any of the directions suggested last week. Add legends, axis labels (e.g. headers and values) or other features to the GUI for plotting data.

    : I further extended my GUI in three directions: legend, axis labels, and more labels to report information regarding the data set.  First, I created color-legend that shows a spectrum of color from the minimum to maximum value of data chosen for color.  The top of the color legend represents the minimum; the bottom represents the maximum.  To create the legend, I needed to create a tk Canvas object in the right control frame, and then draw a series of lines vertically in that canvas.  To show the spectrum of color, I used the loop variable when determining the color of the line.  Second, I enabled my axis labels, X, Y, and Z to update based on the user's selections for the corresponding axes.  Lastly, I displayed labels for the independent and dependent variables along with other coefficients related to linear regression in the right control frame.  See figure 6 for an illustration.
     (Figure 6)

  • Do some more exploration with different data sets using your new tool.

    : I decided to perform more simple linear regression on mydata.csv.  First, I examined relationship the between the inflation (independent) and interest rate (dependent).  Figure 7 shows a relatively strong, positive relationship between the variables.  The possible explanation is that at times of high inflation, the central bank (the Federal Reserve) raised the interest rate to fight inflation.
     (Figure 7)
    I also looked at the relationship between real GDP (independent) and unemployment rate (dependent).  Figure 8 illustrate no linearity between the two variables.  This result is not so surprising as unemployment depends on more than just output.  It also suggests that high unemployment rate is a difficult issue to deal with because one government cannot solve the problem by simply increasing the output.
     (Figure 8)
    Lastly, I examined the relationship between government spending (independent) and corporate investment (dependent) to see if 'crowding-out' effect actually holds.  Crowding-out effect refers to a decrease in business investment induced by an increase in government spending, which causes interest rate to increase.  Figure 9 shows a weakly, positive relationship between the two variables although the r-value is low.  This result is opposite of the crowding-out effect.  I am not sure which other factors led to this result.
     (Figure 9)

  • Give the user the ability to save the linear regression analysis to a file in a human-readable format. Extend it even further to allow the user to read an analysis back in and replot it over the correct data.

    : In order to implement this extension, I created two new functions in display class: storeAnalysis() and rePlot(), which can be activated by the buttons 'Store Analysis' and 'Replot', respectively.  When storing the analysis, I wrote the filename, headers of the independent and dependent variables, headers of data points, and coefficients from the linear regression on a csv file.  I added a time-stamp to the original filename for user's convenience.  I also added a string 'single' or 'multiple' to the beginning of the new csv filename so that my rePlot function can differentiate between the simple and multiple linear regression function.  My rePlot() function then reads in the necessary information from the csv file, created by storeAnalysis() function, and calls buildPoints() and buildLinearRegression() functions again. 

  • Figure out how to save a picture of a plot to a file. 

    : I created a new button 'Store Image' which calls storeImage() function in display class.  The function creates a post-script file given the filename and data of the loaded data-set.  If the user has not read in and plotted the data, the button will give an error message.  I will demonstrate this extension and the one above in demo.


Acknowledgements: Demo

For this project, I received help from professor Taylor and Maxwell.