[Project1] Day 3: Continuing Data Exploration : Understanding Concepts – Advance Mathematical Statistics (MTH 522) Assignments

In my previous blog, I mentioned the term ‘heteroscedacity’. However I didnt expain its relativity in the project. So I decided to revisit the concept.

While plotting the linear regression for diabetes and inactivty, it was observed that the percentage of diabetics due inactivity was 19.51%.On checking the residuals after, the following plot was observed (noted in the last blog).

While there is significant amount of data near the 0-line which ranges between -0.5 and 0.5, there is a larger amount of data which is further from the line.
Heteroscedacity is the ‘fanning out’ i.e the distance of the residuals increases along the line. Heteroscedacity makes the data less relaiable . The common methods to identify heteroscedacity are by visual plots and by performing the Breusch Pagan Test.
When comparing the heteroscadacity of the residual plot for dataframe of the combined obesity-inactivity-diabetes data with the individual inactiviyt-diabetes dataframe, the range of scatterdness of outliers in the combined dataframe is lesser than that of the individual dataframe. <- combined data residual plot
This shows that the combined dataframe is more reliable thant the individual data.
So coming back to Pearson’s square, it should confirm that combined data frame is more reliable than limited data frame.
To get started with calculating R^2, I have claculated the correlation between the data. I will go further into this in the next blog.

Leave a Reply Cancel reply