The main purpose of this project was to integrate a simple linear regression-feature into my GUI, which can display a given data set in three dimensions, enable the user to interact with the displayed data, and report some data-analysis to the user.  Another goal was to implement a multiple linear regression as a function in analysis class.  In summary, there were two tasks: 1) create functions that perform single and multiple linear regression given a data set and 2) incorporate these functions to the GUI by taking the inputs from the user.  Writing functions for single and multiple linear regression was straightforward thanks to the scipy and numpy module.  In order to receive the user's inputs, I created a dialog window that contains three list-boxes for one dependent and two independent variables.  After completing the necessary tasks, I can now enable the user to read in a data set, display it in three dimensions, and apply single & multiple linear regression graphically.  I will explain the process of incorporating multiple linear regression and other extensions later.

Instructions for Linear Regression:

Follow these instructions to run a simple linear regression on the dataset 'data-simple.csv'.

  1. Before running linear regression, the user first must open a file and read it by pressing 'Control-o'.  Once the data set is loaded into the second listbox, the user should then select the dataset and click the button 'Plot Data'.
  2. Once the data is plotted, the user should press 'Control-l' to run linear regression.  If done correctly, the user will see a dialog window (Figure 1).  For simple linear regression, the user should not make a selection in the listbox labeled 'Indepedent Variable1'.  
     (Figure 1)

    If the data is not loaded and plotted, the below error message (Figure 2) will pop up.
     (Figure 2)

    If the user clicks the button 'cancel' or does not make any selections from the listboxes, different error messages will show.

  3. If the user plots X0 on the X-axis and Y on the Y-axis and selected Y as the dependent variable and X0 as the independent variable, below image (Figure 3) will be plotted.
     (Figure 3)
    Note that I reported the slope, intercept, and r-value in the right control frame.

Required Images:

In order to test whether my linear_regression() function works properly, I created a test function called 'test_multiple_regression', which takes a filename as a parameter.  I call this function in the main method of analysis class, so simply run my to test this test function with filename as a command line argument.  I tested with following data sets: data-clean.csv, data-good.csv, and data-noisy.csv.  Note that my test function will work just for these data sets.

The following images show the results of running my linear_regression function on data-clean.csv, data-good.csv, and data-noisy.csv, respectively.

After checking the values printed from my function with those on the webpage, I confirmed that my linear_regression() function works properly.

To test my linear regression functions even further, I used my own dataset called 'mydata.csv', which contains data related to the economy of the U.S. from 1970 to 2015, and performed simple linear regression on following variables: 'Government Spending' as the independent variable and 'Real GDP' as the dependent variable.  See Figure 4 for the result.

 (Figure 4)

Despite some outliers, the data shows a weak, positive linearity (R-value is 0.585).  This weakly positive relationship makes sense because government spending constitutes an open & closed economy.  In economics, this can be explained using following equation: Y (output, i.e. GDP) = C (consumption) + I (business investment) + G (government spending) + NX (net exports).  All else fixed, an increase in government spending should lead to an increase in the output.  Thus, government spending and real GDP should have some positive linearity.

Because I mentioned business investment while discussing an equation above, I decided to perform multiple linear regression using 'Government Spending' as the first independent variable, 'Corporate Investment' as the second independent variable, and 'Real GDP' as the dependent variable.  Figure 5 shows the data plotted in my GUI.

 (Figure 5)

The numerical results can be found in the right control frame.  Given that the r-value decreased from that of previous analysis, government spending and corporate investment do not seem sufficient to explain the variance in real GDP.  Intuitively, this makes sense because other factor, such as consumption and next exports, affect real GDP.



Acknowledgements: Demo

For this project, I received help from professor Taylor and Maxwell.