In my previous blog, I mentioned the term ‘heteroscedacity’. However I didnt expain its relativity in the project. So I decided to revisit the concept.
- While plotting the linear regression for diabetes and inactivty, it was observed that the percentage of diabetics due inactivity was 19.51%.
On checking the residuals after, the following plot was observed (noted in the last blog).
While there is significant amount of data near the 0-line which ranges between -0.5 and 0.5, there is a larger amount of data which is further from the line. - Heteroscedacity is the ‘fanning out’ i.e the distance of the residuals increases along the line. Heteroscedacity makes the data less relaiable . The common methods to identify heteroscedacity are by visual plots and by performing the Breusch Pagan Test.
- When comparing the heteroscadacity of the residual plot for dataframe of the combined obesity-inactivity-diabetes data with the individual inactiviyt-diabetes dataframe, the range of scatterdness of outliers in the combined dataframe is lesser than that of the individual dataframe.
<- combined data residual plot
This shows that the combined dataframe is more reliable thant the individual data. - So coming back to Pearson’s square, it should confirm that combined data frame is more reliable than limited data frame.
- To get started with calculating R^2, I have claculated the correlation between the data. I will go further into this in the next blog.