[Project1] Day 8: Conducting of Cross Validation and Boostrap

Today I went a little indepth into how to do Cross Validation and understanding Bootstrap

  • To start cross validation, first the data has to be divided 2 parts, i.e the training data and the testing data.
  • To do this we need to first decide into how many subsets or folds (k) we will split up the total data; as the training data would be  k-1 folds and the testing data would be the remainder of the data i.e 1 fold.
  • Next we need to select a performance metric. Performance metrics are used to measure behaviour and activities which would help to evaluate our model.
  • We would then repeat this process k-1 times, then take the average of the performance metrics which would be the estimate of the model’s overall performance.
  • By conducting Cross Validation we would estimate the test error
  • From my understanding, Cross Validation is random sampling without replacement whereas in Bootstrap there is replacement.
  • While in cross validation, we conducted the test over a number of training samples but 1 test data which would be useful if we had a large amount of data. But Bootstrap is better when we have less amount of data.  By repeatedly resampling from the observed (limited) data, we would  estimate the distribution.
  • Coming back to the project, Now that i have better understood the concepts I believe that Bootstrapping would better serve our model built on the CDC data as it has only 300 odd data point.

Leave a Reply

Your email address will not be published. Required fields are marked *