[Project 1] DAY1 : Exploring The CDC Data

Today my project group and I started  exploration into the CDC data.

  •  3 sheets are given in the excel file: diabeties, obesity and inactivity.  The titles of the sheets itself give an inclination towards there being individual and combined relationships between  the 3 groups of data.
  • On first glance it is obvious that the amount of data on  the percentage of diabetes (about 3k rows) is far greater than the data of both obesity (a little over 300 rows) and inactivity (a bit over 1k rows).  This was confirmed by merging the dataframes on the  FIPS column.
  • In class today, we discussed the relationship between inactivity and diabetes. So as a group, we decided to explore the relationship between diabeties and obesity.
  • In additon to calculating the discriptive statistics (mean, median, etc.) I visualized these calculations with the help of histograms, box plots and qqplots.
  • Additionally, in class today I was introduced to the concept of ‘Heteroscedacity’. Through the scatter plot it was clearly visible that the data is unevenly scattered. The data will have to be ‘transformed’ (for lack of a better word) to bring the distribution closer to normal.
  • The obesity data was negatively skewed (-2.69210) and the diabetic data had a skeweness of 0.089145. Although the calculation look correct. I feel there is mistake in my work and will go through it again to check if there are any errors.

Leave a Reply

Your email address will not be published. Required fields are marked *