[Project 1] Day 6: Understanding Linear Regression
In Statistical analysis, ‘Regression’ is a method implemented to understand or determine the relationship between 2 or more variables . Using this relationship, we would be able to determine an unknown value which depends on ur predicting variable
*Variable in statistics represents any quantity that can be measured or counted
There are 2 main types of Variables:
a) Categorical variables: the variables which represent groupings
b) Quantitative variables: The variables which use numbers to represent total values.
In the case of the CDC data we have 3 quantitative variables variables, namely % Diabetes, % Inactivity and % Obesity. The categorical variables are ‘STATE’ and ‘COUNTY’ .
In regression , we classify the variables we want to analyze under ‘dependent variable’ and independent variable’.
As the name suggests, ‘Linear Regression’ assumes that there is a linear relationship between the variables. The end result would have us plot a straight line through the data points on a plot which would best describe the relationship.
Simple Linear Regression plots a strait 2 dimensional line to find the relationship between 2 variables.
In our project we would be doing simple linear regression to show the relationships between % Diabetes & % Inactivity, % Diabetes & % Obesity and % Inactivity &% Obesity respectively . This line would be represented by the equation y = β₀ + β₁*x₁ + ε.
Multiple linear Regression, on the other hand , would create a 3 dimensional plot i.e a plane. with the data of % Diabetes, % Inactivity and % Obesity.
Project work: On meeting with the group today, we have decided to explore our options after having calculated the R^2 value for the data. We have begun the ‘feature engineering process’ . On my part, I started feature scaling and am exploring the statistical tests which will be helpfull towards the project.