ErenMcLaren / Automatic_Regression_Line_Plot

Calculate functional form of linear regression of Y on X and produce an associated plot with hypothesis testing values for the slope and y-intercept.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Automatic_Regression_Line_Plot

Purpose: Automate the construction of a "nice-looking" linear regression plot with a table of confidence intervals for the slope and y-intercept of the regression line alongside. Also, I got sick of undergraduate CS students telling me I'll never understand ML because I'm too dumb to even do regression properly.

About:

Regression analysis is a method designed to ascertain a functional relationship among variables [1]. A particular type of regression is linear regression when there are two variables: a predictor variable X and a response variable Y, and we assume that the regression of Y on X is linear, meaning that E(Y|X=x) = mx + b [1]. In the standard notation of statistics, the y-intercept is denoted \beta_{0} and the slope is denoted \beta_{1} [1]. To obtain a functional relationship among X and Y is to analyze the raw data and find a "line of best fit" that seems to closely match the trend in the data. Because we are assuming the regression of Y on X is linear but do not know the population values characterizing a the linear regression, we construct a sample regression line in the form y = b_{0} + b_{1}x [1]. A very popular method of choosing b_{0} and b_{1} is called the method of least squares. The method of least squares seeks to minimize the sum of the squares of the residuals, the residuals being the difference between the actual value of y (the data point) and the predicted value \hat{y} [1]. Performing this optimization problem leads to "normal equations" that can be solved to obtain formulas for both the sample slope and y-intercept of the regression line [1].

This program employs the method of least squares to obtain a functional form of the regression line. Further, it test the hypotheses that the slope and y-intercept of the regression line are 0 using 80%, 90%, 95%, 98%, 99%, and 99.9% confidence intervals. Both the plot of the raw data with the overlayed regression line (and equation) and a table displaying upper and lower bounds for the regression line slope and y-intercept are present in the figure. At the present moment, the figure looks what one might call "pretty good," but it definitely can be optimized. It's somewhat of a pain wrestling with Matplotlib. Any and all suggestions are welcome.

Before performing all the calculations, you must answer a few prompts (1) do you want to see where in the program you are as it runs (verbose), (2) do you want to see a ridiculous amount of output (debugging), (3) your x data values, and (4) your y data values (each data point can be comma or space delimited, but nothing crazy). Yeah, yeah, inefficient, terrible practice, yadda, yadda. Make sure you put your x data values and/or y data values into a list or something first. I dunno. Nobody's reading this anyway.

Would you like to see verbose output? > 
Would you like to see debugging output? (Hint: NO, you DON'T) >
Please enter your x-values for the raw data > 
Please enter your y-values for the raw data >

Feel free to make any modifications or use and abuse this code to your liking. Feel free to copy/paste the entire thing and use it for your own good and/or pretend you wrote it. Seriously. I have no problems with it.

References:

  • [1] Sheather, Simon J. A Modern Approach to Regression with R. Springer, 2009.
  • Parenthetical:

    For math.sqrt(x) being faster than x^0.5 -> https://stackoverflow.com/a/327011
    For finding the "longest" list in another list -> https://stackoverflow.com/a/42827330
    For how to access the error as "e" -> https://stackoverflow.com/a/4690655
    For rounding a number to n -> https://stackoverflow.com/a/3411435
    For typesetting LaTeX in Matplotlib -> https://stackoverflow.com/q/30281485
    For dealing with annoying LaTeX string parsing with Matplotlib -> https://stackoverflow.com/a/2476868
    For future reference, consider using Pandas DF for simple Matplotlib tables -> https://stackoverflow.com/a/46664216
    For why Matplotlib tables' fontsize is broken & need a workaround -> https://stackoverflow.com/a/15514091
    For why and how on earth I need to color every table cell and column -> https://stackoverflow.com/a/46664216
    For horizonally aligning text within Matplotlib table -> https://stackoverflow.com/a/38173860
    For getting things in a dictionary in a for loop -> https://stackoverflow.com/a/17793421
    For "generating" a dictionary in a for loop -> https://stackoverflow.com/a/30280874
    For future reference, consider using Pandas DF for simple Matplotlib tables -> https://stackoverflow.com/a/46664216
    For why Matplotlib tables' fontsize is broken & need a workaround-> https://stackoverflow.com/a/15514091
    For why and how on earth I need to color every table cell and column -> https://stackoverflow.com/a/46664216

    About

    Calculate functional form of linear regression of Y on X and produce an associated plot with hypothesis testing values for the slope and y-intercept.


    Languages

    Language:Python 100.0%