[Project1] Day 9: Handling Missing data

Today while working on the CDC data I faced the problem of handling missing data.

  • There are many reasons in general for certain values of data being missing, the 2 which i mainly believe could be
    1. Past data might get corrupted due to improper maintenance
    2. The data was not recorded/ human error, etc
  • in the case of our data  diabetes about 3k rows but the the obesity data little over 300 rows and inactivity a bit over 1k rows.
  • Due to this on merging all three data sets i had a little over 300 rows of data. But when i checked the data, there were still 9 null values in the % inactive column.
  • Now if the data were vast, i would have considered dropping the rows which had null values. However since we only have 300 rows, I was not too keen on lessening the data.
  • on researching, I found main method i found which would help to handle these missing values was Imputing the Missing Values. There were 4 types of ways of doing this.
  • Imputing an arbitrary value involved replacing missing values with a specified (arbitrary) value like -3, 0, 7,  etc.  But ths method has a lot of disadvantages. but it has a number of disadvantages like  if the arbitrary value  could inject bias into the dataset if it is not indicative of the ‘underlying data’. It could also limit the variability of the data an make it difficult to find patterns.
  • Replacing with the mean involves, as the name suggests, imputing the data with the average or mean value of the column. The only scenario where this is not appropriate is if there are outliers. However since we already treated the outliers, i used this method to fill in the missing values and continue with my analysis of the data
  • The other 2 ways of handling the missing data would be Replacing with the median and Replacing with the mode.

[Project1] Day 8: Conducting of Cross Validation and Boostrap

Today I went a little indepth into how to do Cross Validation and understanding Bootstrap

  • To start cross validation, first the data has to be divided 2 parts, i.e the training data and the testing data.
  • To do this we need to first decide into how many subsets or folds (k) we will split up the total data; as the training data would be  k-1 folds and the testing data would be the remainder of the data i.e 1 fold.
  • Next we need to select a performance metric. Performance metrics are used to measure behaviour and activities which would help to evaluate our model.
  • We would then repeat this process k-1 times, then take the average of the performance metrics which would be the estimate of the model’s overall performance.
  • By conducting Cross Validation we would estimate the test error
  • From my understanding, Cross Validation is random sampling without replacement whereas in Bootstrap there is replacement.
  • While in cross validation, we conducted the test over a number of training samples but 1 test data which would be useful if we had a large amount of data. But Bootstrap is better when we have less amount of data.  By repeatedly resampling from the observed (limited) data, we would  estimate the distribution.
  • Coming back to the project, Now that i have better understood the concepts I believe that Bootstrapping would better serve our model built on the CDC data as it has only 300 odd data point.

[Project 1] Day 7: Intro to Resampling Methods

Today I started learning about resampling techniques, In particular Cross Validation.

  • What is Resampling?  From what I understood, since sampling is process of data collection, resampling is the conduction of repetitive tests on on the same sample or the creation of a new samples on the basis of the 1st observed sample.
  • Why do we use Resampling? When we create prediction models on some data, it is always good to test it on new data. but since we may not always had new data, we can use resampling methods to generate new data.
  • The main usage of Cross Validation is for checking our prediction model for test errors due to over fitting.
    * Test Error is the mean (avg.) error that comes from testing new data while Training Error is the error computed when testing our training data.
    *What is Overfitting? When we conduct our regression analysis, if we ‘fit’ the line extremely close to certain data points, it is said to be over fit.  It would result in the model being fit only for this initial data and not help give a good prediction for other data.
  • In cross-validation we divide the data into 2 parts: the training data and the validation data. In simple words, the training data would be used to ‘train’ or fit the model to the data, and this fitted model would then be used  to try and predict the outcomes in the validation data.
  • With relation to the current project on the cdc data, im still considering if I want to use this approach as compared to bootstrap.

[Project 1] Day 6: Understanding Linear Regression

  • In Statistical analysis, ‘Regression’ is a method implemented to understand or determine the relationship between 2 or more variables . Using this relationship, we would be able to determine an unknown value which depends on ur predicting variable
    *Variable in statistics represents any quantity that can be measured or counted
    There are 2 main types of Variables:
    a) Categorical variables: the variables which represent groupings
    b) Quantitative variables: The variables which use numbers to represent total values.
    In the case of the CDC data we have 3 quantitative variables variables, namely % Diabetes, % Inactivity and % Obesity. The categorical variables  are ‘STATE’ and ‘COUNTY’ .
  • In regression ,  we classify the variables we want to analyze under ‘dependent variable’ and independent variable’.
  • As the name suggests, ‘Linear Regression’ assumes that there is a linear relationship between the variables. The end result would have us plot a straight line through the data points on a plot which would best describe the relationship.
  • Simple Linear Regression plots a strait 2 dimensional line to find the relationship between 2 variables.
    In our project we would be doing simple linear regression to show the relationships between  % Diabetes &  % Inactivity, % Diabetes & %  Obesity and % Inactivity &% Obesity respectively . This line would be represented by the equation y = β₀ + β₁*x₁ + ε.
  • Multiple linear Regression, on the other hand ,  would create a 3 dimensional plot i.e a plane. with the data of % Diabetes, % Inactivity and % Obesity.
  • Project work: On meeting with the group today, we have decided to explore our options after having calculated the R^2 value for the data. We have begun the ‘feature engineering process’ . On my part,  I started  feature scaling and am exploring the statistical tests which will be helpfull towards the project.

Analyzing Crab Molt Data

In toda’s class, we discussed the analyssis of the ‘crab molt’ data.

  • Crab molting is the action of the crab breaking ints outer skeleton or exoskeleton inorder to grow. Our objective was to try and predict the size of the crab before molting using the post molt size of the crabs.
  • To begin with we plotted a linear model and found that the value of R^2 wa 0.98.
  • The descriptive statistics of the post-molt data showed a skewness of -2.3469 and kurtosis of 13.116 and that of the pre molt data showed a skewness of -2.00349 and a kurtosis of 9.76632
  • Looking at the graphs of the normal curve of both pre and post molt data, it appeared that they were similar with just a shift in mean.
  • To check wheter this was true we conducted a T-test.
    * T-test is a statistical test which is used to compare the mean values of two groups. To do this we create two hypothesis:
    H0 = there being no significant difference
    H1 = there being a significant difference
  • The T-test we conducted on the crab molt data indicated that out null hypotheses (H0) was false; there is a significant difference in the means of both pre-molt and post-molt data. We did this by carrying out ‘Monte-Carlo’ procedure to estimate the p-value for the difference of means which was observed

[Project 1] Day 4: Diabetes,Inactivity,Obesity Distribution by State

Today i tried to see if there was any relation between the distribution of the average %diabetes ,%inactivity and %obesity with by state
1. The bar plot of % diabetes showed that the states of South Carolina, Alabama, Mississippi, Delaware, Florida and Maryland have the highest aveager % of diabetics
(>=10%)


2. The bar plot of % inactivity showed that the states of Alabama, Kentucky, Nebraska, New York and Oklahoma have the highest average % of inactivity (>=17%)


3. The bar plot of % obese showed that barring 10 states, all the others have have the high average % of obesity (>18%)


4. While plotting the corelation there seems to be no correlation between ‘State’ and ‘%Diabetic,%Inactivity and %Obesity. However there a couple of states like Alabama and Kentucky that have corresponding high or low values of each which would indicate that there could be a correlation. one hing to note would be that this correlations is occuring on only 300 or so data points, there are many counties whose data we do not have and this could account for the negative correlation.
5. Another thing to note is that there is a positive correlation between ‘%Diabetic- %Inactivity’ ,’%Diabetic -%Obesity’ and ‘%Inactivity -%Obesity’ of 0.7032549 ,0.5269106 and 0.590831 respectively.

[Project1] Day 3: Continuing Data Exploration : Understanding Concepts

In my previous blog, I mentioned the term ‘heteroscedacity’. However I didnt expain its relativity in the project. So I decided to revisit the concept.

  • While plotting the linear regression for diabetes and inactivty, it was observed that the percentage of diabetics due inactivity was 19.51%.On checking the residuals after, the following plot was observed (noted in the last blog).

    While there is significant amount of data near the 0-line which ranges between -0.5 and 0.5, there is a larger amount of data which is further from the line.
  • Heteroscedacity is the ‘fanning out’ i.e the distance of the residuals increases along the line. Heteroscedacity makes the data less relaiable . The common methods to identify heteroscedacity are by visual plots and by performing the Breusch Pagan Test.
  • When comparing the heteroscadacity of the residual plot for dataframe of the combined obesity-inactivity-diabetes data with the individual inactiviyt-diabetes dataframe, the range of scatterdness of outliers in the combined dataframe is lesser than that of the individual dataframe. <- combined data residual plot
    This shows that the combined dataframe is more reliable thant the individual data.
  • So coming back to Pearson’s square, it should confirm that combined data frame is more reliable than limited data frame.
  • To get started with calculating R^2, I have claculated the correlation between the data. I will go further into this in the next blog.

[Project 1] Day 2: Continuing Data Exploration

Today along with my group mates, we decided to keep the obesity-diabeties work on hold and start with the inactivity-diabetes data.

  • The first thing that concerned me was using the inactive data in a spreadsheet which was combined with the obesity data vs using only the inactive data spreadsheet. The reason for this is that when combining all 3 spreadsheets, the number of data points comes down to 354 whereas when only the the inactivity spreadsheet is used there are  1370 data points.
  • So i decided to compare the descriptive statistics of both the merged dataframe and the individual dataframe. The comparisons of the 2 visualizations are as follows:
  1. HISTOGRAMS :When we check the individual inactivity data histogram, it is apparent that the data is negatively skewed However when we check the inactive data from the  dataframe which contains diabetic, inactive and obese percentage data we see that the inactive data, all though slightly skewed, is closer to a normal curve.
  2.  Residuals:  When we compare the  plots for residuals for bothe the data frames, it is observed that due to more data points the plot of the combined data frame has more points closer to the line than the plot of the individual dataframe, however it has more heteroscedacity as well.
       <- image for individual inactive dataframe
    <- image for combined inactive dataframe
  • Owing to these differences we have decided calculate R^2 value for both and compare it with each other to observe which dataframe is the better alternative

 

 

[Project 1] DAY1 : Exploring The CDC Data

Today my project group and I started  exploration into the CDC data.

  •  3 sheets are given in the excel file: diabeties, obesity and inactivity.  The titles of the sheets itself give an inclination towards there being individual and combined relationships between  the 3 groups of data.
  • On first glance it is obvious that the amount of data on  the percentage of diabetes (about 3k rows) is far greater than the data of both obesity (a little over 300 rows) and inactivity (a bit over 1k rows).  This was confirmed by merging the dataframes on the  FIPS column.
  • In class today, we discussed the relationship between inactivity and diabetes. So as a group, we decided to explore the relationship between diabeties and obesity.
  • In additon to calculating the discriptive statistics (mean, median, etc.) I visualized these calculations with the help of histograms, box plots and qqplots.
  • Additionally, in class today I was introduced to the concept of ‘Heteroscedacity’. Through the scatter plot it was clearly visible that the data is unevenly scattered. The data will have to be ‘transformed’ (for lack of a better word) to bring the distribution closer to normal.
  • The obesity data was negatively skewed (-2.69210) and the diabetic data had a skeweness of 0.089145. Although the calculation look correct. I feel there is mistake in my work and will go through it again to check if there are any errors.