Today I went a little indepth into how to do Cross Validation and understanding Bootstrap
- To start cross validation, first the data has to be divided 2 parts, i.e the training data and the testing data.
- To do this we need to first decide into how many subsets or folds (k) we will split up the total data; as the training data would be k-1 folds and the testing data would be the remainder of the data i.e 1 fold.
- Next we need to select a performance metric. Performance metrics are used to measure behaviour and activities which would help to evaluate our model.
- We would then repeat this process k-1 times, then take the average of the performance metrics which would be the estimate of the model’s overall performance.
- By conducting Cross Validation we would estimate the test error
- From my understanding, Cross Validation is random sampling without replacement whereas in Bootstrap there is replacement.
- While in cross validation, we conducted the test over a number of training samples but 1 test data which would be useful if we had a large amount of data. But Bootstrap is better when we have less amount of data. By repeatedly resampling from the observed (limited) data, we would estimate the distribution.
- Coming back to the project, Now that i have better understood the concepts I believe that Bootstrapping would better serve our model built on the CDC data as it has only 300 odd data point.