Why do we need cross-validation??

asheesh kumar singhal
2 min readAug 31, 2020

Cross validation is an important concept in machine learning world. It helps to check how your algorithm will perform-when unknown data is given to it.

Consider a data with ’n’ data-points. Lets first split it into 2 sets train and test data. After splitting the data, when its given to a machine learning algorithms the various hyper-parameters (i.e. K in KNN) and accuracy are calculated using D-Train (Training Dataset) and D-Test(Testing- Dataset). D-Train will train various parameters (hidden- paremeters/ weights ) of the algorithm and D-Test will help to tune various hyper-parameters using accuracy, error and other results. This means that D-test is directly given to algorithms to learn but,it is used in someway to tune the algorithms along with Training datasets.

Since both D-Test and D-train are used for the learning purpose, none of the data is unseen or hidden from the algo, hence D-test cannot be used to validate the results on the future incoming query points. The accuracy obtained on this data split cannot be used for the future incoming data with same confidence as we obtained in this data.

Hence we split the data into 3 sets D-Train, D-Test and D-Validation. We will use Train data as usual and use D-validation in place of D-Test data as we were doing it in last case. This make D-Test data as the unseen data, behaves as incoming future and enables us to check our hypothesis. We do not make use of this data for any learning process and also do not tune any parameter using it. Thus the result (accuracy and error) what we have received for validated data can be said its equivalent to the what we will get once the model is deployed in production env.

In this way splitting the data into train, valid and test provides an additional quality check and helps to verify how model is likely to perform when future data comes.

--

--