Overfitting

Table of contents

Summarise with:

The term «overfitting»in machine learning refers to a problem that arises when a model fits too well to the training data, This leads to a reduction in their ability to generalise well on new data that have not been seen during the training process.  

In other words, the model fits very well with the particularities and the noise present in the training data set, but loses the ability to identify meaningful patterns that can be applied to previously unseen data. This concept is also known as «overadjustment«. 

Consequences of over-adjustment

The over-fitted models often exhibit high accuracy on the training data set, but show poor accuracy on new data, known as the test set or validation set.  

Overfitting occurs because the model tries to find rules of thumb in the training sample that, in reality, do not exist and, instead, the model tries to find rules of thumb in the training sample that, in reality, do not exist, finds structures and patterns in the noise of the training sample

Some signals that indicate that a model may be overtrained are: 

  • Wide variation in model performance metrics between the training and validation datasets. 

  • Low generalisation of the model when used on previously unseen data. 

  • Excessive complexity in the structure of the model compared to the signal-to-noise ratio of the data. 

The consequences of overfitting can be very negative for the overall performance of a model, as it loses the ability to effectively predict or classify new or unpublished data. Therefore, detecting and preventing overfitting must be an integral part of the machine learning process. 

How to prevent over-adjustment?

For prevent overfitting, various strategies can be employed: 

  • Using regularisation techniquesThe model losses are penalised by adding a penalty to the model losses depending on the complexity of the model. This encourages simplicity and reduces the model's ability to over-fit the training data. 

  • Increase the size of the datasetproviding the model with more examples in the training set can help minimise overfitting, as the likelihood of the model memorising the particulars of the training set is reduced. 

  • Use cross-validation: consists of dividing the training data set into several subsets and training the model on these subsets while evaluating it on the rest. In this way, a more accurate estimate of the model's performance on unknown data can be obtained. 

  • Reducing the complexity of the modelSimplifying the model structure, such as reducing the number of parameters or the depth of the model in decision trees, can help reduce the risk of overfitting. 

 

Variance and overfitting in overfitting

The concept of overfitting is closely related to the concept of «overfitting".«variance-bias trade-off»in machine learning. Variance and bias are properties of a model that influence its predictive performance: 

  • The bias refers to the simplicity of the model and the ability to ignore noise in the data. A model with a high bias oversimplifies the relationship between input and output data, which can result in poor prediction in training and test data sets. 

  • The variance refers to the sensitivity of the model to noise in the training data. A model with a high variance captures even noise in the training data set, leading to overfitting. 

It is important to find an optimal balance between bias and variance, as both extremes can be detrimental to model performance. A model with high variance and low bias over-fits the data, while a model with low bias and high variance suffers from bias and does not fit the data well enough. 

Share in:

Related articles

Usability

Usability is a fundamental concept in product design and development, especially in the digital domain. It refers to the extent to which a system, product or service can be used effectively and efficiently by its users, thus satisfying their needs.

Blue screen

A blue screen, also called a blue screen of death or, colloquially, blue screen of death, is an error of a certain severity that causes the computer to stop working without warning. It is considered a serious system error, since it

Meta tag

Meta tags are snippets of text that describe the content of a web page; they do not appear on the page itself, but in the HTML code, and help search engines understand the content of the page. Although they are not as

Git

Git is a distributed version control system used to manage and track changes to files in a project, especially in software projects. Created by Linus Torvalds in 2005, Git allows developers to collaborate efficiently and efficiently with each other.

Scroll to Top