A Simple Introduction to Cross-Validation

Ayush Gupta
9 min readJan 1, 2024

Introduction

Machine learning is the process of creating models that can learn from data and make predictions or decisions. However, machine learning models are not perfect and may not always perform well on new, unseen data. This is because they may be overfitting the training data, which means they memorize the noise and details of the training data but fail to generalize to new data.

To avoid overfitting and ensure that the model can generalize well to new data, we need to evaluate its performance on different subsets of the data. This is where cross validation comes in.

Cross validation is a technique that involves dividing the available data into multiple folds or subsets, using one of these folds as a validation set, and training the model on the remaining folds. This process is repeated multiple times, each time using a different fold as the validation set. Finally, the results from each validation step are averaged to produce a more robust estimate of the model’s performance.

In this article, we will see what cross validation is, why it is used and why it should be used, how to do cross validation using scikit-learn, when to do cross validation, and how it can benefit in the overall model performance and development.

What is Cross Validation?

Cross validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is…

--

--

Ayush Gupta

I write about Software Engineering, Data Science, Productivity and Personal Growth | Sharing lessons learnt during 5+ years in the industry