[Project1] Day 9: Handling Missing data

Today while working on the CDC data I faced the problem of handling missing data.

  • There are many reasons in general for certain values of data being missing, the 2 which i mainly believe could be
    1. Past data might get corrupted due to improper maintenance
    2. The data was not recorded/ human error, etc
  • in the case of our data  diabetes about 3k rows but the the obesity data little over 300 rows and inactivity a bit over 1k rows.
  • Due to this on merging all three data sets i had a little over 300 rows of data. But when i checked the data, there were still 9 null values in the % inactive column.
  • Now if the data were vast, i would have considered dropping the rows which had null values. However since we only have 300 rows, I was not too keen on lessening the data.
  • on researching, I found main method i found which would help to handle these missing values was Imputing the Missing Values. There were 4 types of ways of doing this.
  • Imputing an arbitrary value involved replacing missing values with a specified (arbitrary) value like -3, 0, 7,  etc.  But ths method has a lot of disadvantages. but it has a number of disadvantages like  if the arbitrary value  could inject bias into the dataset if it is not indicative of the ‘underlying data’. It could also limit the variability of the data an make it difficult to find patterns.
  • Replacing with the mean involves, as the name suggests, imputing the data with the average or mean value of the column. The only scenario where this is not appropriate is if there are outliers. However since we already treated the outliers, i used this method to fill in the missing values and continue with my analysis of the data
  • The other 2 ways of handling the missing data would be Replacing with the median and Replacing with the mode.

Leave a Reply

Your email address will not be published. Required fields are marked *