[Project 1] Day 2: Continuing Data Exploration – Advance Mathematical Statistics (MTH 522) Assignments

Today along with my group mates, we decided to keep the obesity-diabeties work on hold and start with the inactivity-diabetes data.

The first thing that concerned me was using the inactive data in a spreadsheet which was combined with the obesity data vs using only the inactive data spreadsheet. The reason for this is that when combining all 3 spreadsheets, the number of data points comes down to 354 whereas when only the the inactivity spreadsheet is used there are 1370 data points.
So i decided to compare the descriptive statistics of both the merged dataframe and the individual dataframe. The comparisons of the 2 visualizations are as follows:

HISTOGRAMS :When we check the individual inactivity data histogram, it is apparent that the data is negatively skewed However when we check the inactive data from the dataframe which contains diabetic, inactive and obese percentage data we see that the inactive data, all though slightly skewed, is closer to a normal curve.
Residuals: When we compare the plots for residuals for bothe the data frames, it is observed that due to more data points the plot of the combined data frame has more points closer to the line than the plot of the individual dataframe, however it has more heteroscedacity as well.
<- image for individual inactive dataframe
<- image for combined inactive dataframe

Owing to these differences we have decided calculate R^2 value for both and compare it with each other to observe which dataframe is the better alternative