[Project 1] Day 2: Continuing Data Exploration

Today along with my group mates, we decided to keep the obesity-diabeties work on hold and start with the inactivity-diabetes data.

  • The first thing that concerned me was using the inactive data in a spreadsheet which was combined with the obesity data vs using only the inactive data spreadsheet. The reason for this is that when combining all 3 spreadsheets, the number of data points comes down to 354 whereas when only the the inactivity spreadsheet is used there are  1370 data points.
  • So i decided to compare the descriptive statistics of both the merged dataframe and the individual dataframe. The comparisons of the 2 visualizations are as follows:
  1. HISTOGRAMS :When we check the individual inactivity data histogram, it is apparent that the data is negatively skewed However when we check the inactive data from the  dataframe which contains diabetic, inactive and obese percentage data we see that the inactive data, all though slightly skewed, is closer to a normal curve.
  2.  Residuals:  When we compare the  plots for residuals for bothe the data frames, it is observed that due to more data points the plot of the combined data frame has more points closer to the line than the plot of the individual dataframe, however it has more heteroscedacity as well.
       <- image for individual inactive dataframe
    <- image for combined inactive dataframe
  • Owing to these differences we have decided calculate R^2 value for both and compare it with each other to observe which dataframe is the better alternative

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *