3.2.3 Overfitting and Underfitting

One of the core challenges in machine learning is building models that generalize well to new, unseen data. If a model is too simplistic, it may underfit the data, failing to capture important patterns. Conversely, if a model is too complex, it may overfit the training data, learning noise or spurious details that don’t generalize. Achieving the right balance between underfitting and overfitting is crucial for predictive performance on test data. In this post, we’ll explain what overfitting and underfitting mean, explore their mathematical underpinnings (including training/validation loss curves and the bias–variance decomposition), and illustrate examples ranging from simple polynomial regression to deep neural networks. We’ll also discuss real-world cases in finance, healthcare, and NLP, and outline strategies to mitigate these issues based on evidence and best practices.

What Are Overfitting and Underfitting?

Overfitting occurs when a model learns the training data too well – including noise and outliers – such that it fails to generalize to new data. In other words, the model becomes overly complex and starts to “memorize” quirks of the training set rather than learning the underlying trend. An overfit model often has very low error on training data but much higher error on validation or test data. Formally, an overfitted model may have more parameters than can be justified by the data, extracting spurious patterns from noise. A classic symptom is when the model’s performance on unseen data is significantly worse despite near-perfect performance on training data.

Underfitting, on the other hand, happens when a model is too simple to capture the underlying structure of the data. An underfit model has high errors on both training and test sets because it hasn’t learned the relevant patterns. For example, fitting a straight line (linear model) to data that actually follows a nonlinear curve will underfit – the model cannot represent the curvature, leading to large errors on all data. Underfitting is characterized by high bias (strong assumptions that make the model inflexible) and poor performance even on the training data.

Visualizing underfitting vs. a proper fit vs. overfitting on a regression task. The left panel (linear model) shows high bias underfitting – the model is too simple to follow the data’s curved trend. The middle panel shows a model that fits the data trend well (low bias, low variance). The right panel (very high-degree polynomial) shows high variance overfitting – the model wiggles to pass through every data point, even noise.

In simple terms, underfitting is like a student who didn’t study enough – they haven’t learned the material, so they perform poorly on both practice and real exams. Overfitting is like a student who memorized answers without understanding – they ace the practice (training) tests but struggle with new questions on the real exam. The goal is a model that learns the generalizable concepts (like a student who truly understands the subject) – this model will do well on both training and unseen data.

Mathematical Perspective: Bias–Variance Tradeoff and Learning Curves

Bias–Variance Decomposition and Tradeoff

Underfitting and overfitting can be understood through the bias–variance tradeoff, which quantifies model errors from two sources: bias (error due to overly simple assumptions) and variance (error due to sensitivity to fluctuations in the training data). In an ideal model, both bias and variance are low. However, increasing model complexity tends to decrease bias but increase variance, whereas simplifying a model increases bias but lowers variance. The bias–variance decomposition expresses the expected generalization error as the sum of bias², variance, and irreducible noise:

$\text{Generalization Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$

This equation expresses the decomposition of generalization error into three fundamental components:

Bias²: Error due to overly simplistic assumptions or models that fail to capture the true complexity of the underlying data.
Variance: Error arising from the model's sensitivity to fluctuations in the training dataset; complex models often have high variance.
Irreducible Error: The inherent noise in the data that cannot be reduced regardless of model complexity or training methods.

The objective in machine learning is to find the right balance between bias and variance to minimize the overall generalization error.

In this formulation, an underfitted model typically has high bias and relatively low variance (it’s consistently wrong in the same way), whereas an overfitted model has low bias (it fits training data well) but high variance (its performance varies greatly for different data samples). The irreducible error is due to inherent noise in the data and cannot be eliminated. Successful learning involves finding a model complexity that balances bias and variance to minimize the total error.

Conceptual illustration of the bias–variance tradeoff. As model complexity increases (e.g., using more polynomial terms or a more complex model), the training error (yellow curve) steadily decreases, indicating the model is fitting the training data better. The validation (generalization) error (red curve) initially drops but then rises once the model becomes too complex and starts overfitting. The sweet spot (dashed line) is where validation error is minimized – beyond this point, adding complexity only improves the fit to training data (lower bias) at the cost of worse performance on new data (higher variance).

In the left region of the above plot (low complexity), the model cannot capture the data patterns – both training and validation errors are high (underfitting, high bias). In the right region (very high complexity), training error is near zero, but validation error has worsened due to overfitting (high variance). A balanced model lives in between these extremes, achieving low error on training data and similarly low error on validation data. This tradeoff underlies most decisions in model selection and regularization.

Training vs. Validation Loss Curves (Learning Curves)

Another way to diagnose overfitting or underfitting is by examining learning curves – plots of training and validation loss (or error) over training iterations or epochs. In an overfitting scenario, the training loss will keep decreasing (since the model is learning even very fine-grained details of the training set), while the validation loss stops decreasing and eventually starts increasing after a point. This divergence indicates the model begins to memorize the training data and is no longer improving on generalization. For example, the plot below shows training loss steadily dropping, but validation loss reaching a minimum and then rising sharply – a hallmark of overfitting:

Example of training vs. validation loss curves for an overfitting model. The training loss (yellow) continually decreases, approaching zero, whereas the validation loss (orange) initially decreases but then turns upward after epoch ~15. At this point, additional training makes the model perform worse on validation data, signaling that it’s overfitting the training data’s noise.

In contrast, for an underfitting model, both training and validation losses will start off high and remain relatively high, with the gap between them typically small. The model is not even fitting the training data well, so adding more training epochs or complexity might continue to reduce both losses. If you observe that even the training loss is far from low and close to the validation loss, it suggests underfitting (high bias) – the model has not learned the data’s structure adequately. In such a case, increasing model capacity or training longer can help reduce both errors. The ideal learning curve for a well-fit model is one where training loss is low and validation loss is only slightly higher and flatlined (converged), indicating the model is fitting the data without significant overfitting.

Examples from Simple Models to Deep Learning

Polynomial Regression Example (Classic ML)

A classic demonstration of underfitting vs. overfitting is polynomial regression on a nonlinear function. Imagine we have data points from a sine wave (which is nonlinear). If we fit a linear model (polynomial of degree 1), it will underfit – a straight line cannot capture the wave’s curvature, yielding large errors. If we increase the degree to, say, 4, the polynomial can approximate the sine wave much better (low training error and still low validation error). However, if we use a very high-degree polynomial, it can wiggle through every data point, even fitting random noise in the training set. Such a degree-9 or 10 polynomial might pass exactly through all training points (zero training error) but oscillate wildly between them, leading to poor predictions on new points (high validation error). The scikit-learn documentation illustrates this: a degree-1 model underfits (fails to fit the samples), degree-4 fits the true curve well, but higher degrees start to overfit the noise.

Polynomial regression example demonstrating underfitting vs. overfitting. Blue dots represent data points from a noisy nonlinear function. The green line is a simple linear fit (underfitting) that cannot capture the curve. The red line is a high-degree polynomial that passes through every point – a perfect fit to training data – but this overly complex curve is likely to perform poorly on new data (overfitting). In this case, a moderately complex model would generalize best.

Another example is decision trees: a tree with very shallow depth might underfit (high bias, since it imposes coarse predictions), whereas an extremely deep tree can overfit by partitioning the data into very narrow segments that capture noise. The optimal depth is somewhere in between, found via validation. In general, whether using linear regression with polynomial features, decision trees, or other algorithms, we see the same pattern: there is an optimal model complexity that minimizes validation error, while models too simple or too complex fare worse.

Overfitting in Deep Learning (CNNs and RNNs)

Modern deep learning models, with millions of parameters, are especially prone to overfitting if not trained carefully. For instance, in computer vision, a convolutional neural network (CNN) can overfit a small image dataset by memorizing distinctive details of each training image (e.g. exact pixel patterns or background noise) instead of learning general features of the object. Such a CNN would achieve near-perfect accuracy on training images but fail to recognize the same objects in new images because it didn’t learn the true distinguishing features. On the flip side, if a CNN architecture is too simple or trained on too few epochs, it might underfit – yielding blurred or inaccurate feature detection and poor accuracy on both training and test sets.

In natural language processing, recurrent neural networks (RNNs) or other language models illustrate these issues as well. An overfitted language model might memorize entire training sentences or context-specific patterns. For example, a sentiment analysis model trained only on movie reviews might overfit to that domain’s terminology and writing style; it will perform poorly when asked to analyze, say, customer product reviews, because it learned patterns very specific to movie reviews. This happens when the model is complex and the training data is limited or narrow – the model picks up idiosyncrasies that don’t generalize. In contrast, an underfit NLP model could be one that uses an overly simple representation (for instance, a bag-of-words model with no context) – it may miss important nuances like word order, sarcasm, or idioms, leading to high errors on all data. In practice, deep learning practitioners monitor training/validation performance closely; techniques like early stopping (discussed later) are used to prevent a network from over-training on the data and starting to overfit.

Real-World Use Cases and Implications

Overfitting and underfitting are not just theoretical problems – they have tangible impacts in various domains:

Financial Forecasting: In finance, models trained on historical market data can overfit by capturing anomalies or rare patterns that occurred in the past but aren’t predictive of the future. For example, a stock prediction model might be tuned to specific fluctuations (noise) in past stock prices; it will perform poorly when market conditions change. An overfit trading algorithm may show great back-test performance but incur large losses in live trading because it was too tailored to past idiosyncrasies. Conversely, an underfit model (perhaps a very basic linear model for a complex market) will miss key predictive signals and perform poorly on both historical and future data. Successful financial models use techniques to avoid overfitting, ensuring the model tracks broad trends and relationships rather than noise.
Healthcare (Medical Diagnosis): In medical AI, overfitting can be dangerous. Suppose we train a diagnostic model on a small, homogeneous patient dataset – the model might memorize peculiarities of that group (say, imaging artifacts from one hospital’s machine or demographic-specific health indicators). This overfit model will fail to generalize to a broader patient population. For instance, a pneumonia detection CNN trained on chest X-rays from Hospital A may perform poorly on Hospital B’s X-rays if it overfit to irrelevant features present only in Hospital A’s images. Underfitting is also possible if the model is too simplistic (e.g. a linear classifier for a highly nonlinear biological process) – it would miss complex symptom patterns, yielding high misdiagnosis rates. Ensuring diverse training data and using techniques like cross-validation are critical in healthcare to strike the right balance and achieve robust generalization.
Natural Language Processing (NLP): NLP models can easily overfit or underfit depending on data and model size. As a real example, consider an email spam classifier: an overfit model might memorize specific spam emails in the training set (for instance, certain phrases or sender addresses), leading to high training accuracy. But spammers constantly change content; the overfit classifier won’t catch new spam variants (poor generalization). In contrast, an underfit spam classifier might be so limited (e.g., looking only for a few keywords) that it misses many spam emails and also flags legitimate emails wrongly. Domain adaptation is a related challenge: a sentiment model trained only on one type of text (say, movie reviews) can struggle on a different domain (like news articles) – essentially an overfitting to the training domain’s language patterns. This is why large language models are trained on very diverse data and use regularization techniques; yet, even they show signs of memorizing rare training phrases if over-trained. The key in NLP is to use models that are complex enough to capture language nuances but to regularize and validate them on diverse data so they generalize to different linguistic contexts.

Mitigation Strategies for Overfitting and Underfitting

Developing a good model involves applying strategies to combat overfitting and underfitting. Here are some proven techniques:

Regularization (e.g. L1, L2, Dropout): Regularization adds a penalty term to the loss function to discourage overly complex models. For instance, L2 regularization (ridge) adds a penalty proportional to the square of model weights, and L1 regularization (lasso) adds a penalty proportional to the absolute weights. This effectively keeps weights smaller, preventing the model from fitting noise. In neural networks, dropout is a regularization technique where random neurons are dropped during training, forcing the network to learn redundant, robust features rather than relying on precise patterns (thus reducing overfitting). Overall, regularization introduces bias (a bit more error on training data) but significantly lowers variance, improving generalization.
Early Stopping: This strategy monitors model performance on a validation set during training and halts training when the validation loss starts to rise (or other chosen stopping criterion). By stopping at the point of minimal validation error, we prevent the model from over-training on the noise in the training data. Early stopping is simple yet very effective in practice – it finds the sweet spot before the model begins to overfit. It essentially acts as a guardrail to maintain the best generalization observed during training.
Data Augmentation: If overfitting is due to limited data, an effective approach is to increase the training data size. Data augmentation generates additional training examples by transforming existing ones (for example, flipping or rotating images, adding noise to audio, paraphrasing text). This exposes the model to a wider variety of scenarios and makes it less likely to latch onto quirks of any single example. In computer vision and speech recognition, aggressive augmentation is often used to prevent overfitting. In essence, more (diverse) data helps the model learn the true underlying patterns rather than noise.
Cross-Validation: Cross-validation is a technique for more reliable model evaluation and hyperparameter tuning. In k-fold cross-validation, the data is split into k folds; the model is trained on k–1 folds and validated on the remaining fold, repeated for each fold, and performances are averaged. This process not only provides a better estimate of model performance on unseen data, but it also helps in model selection – we can try different model complexities or hyperparameters and pick the one that yields the best validation performance averaged across folds. By ensuring the model is tested on multiple subsets of data during development, cross-validation reduces the chance of overfitting to a particular train–test split. It also can be used to detect underfitting (if all folds show high error, the model is likely too simple). In practice, techniques like k-fold CV, along with regularization, guide us toward models that generalize well.

Other best practices include using simpler models or architectures when appropriate (Occam’s Razor), ensembling multiple models (to cancel out individual overfitting tendencies), and ensuring quality, representative data. By monitoring metrics and applying these strategies, we navigate the bias–variance tradeoff towards an optimal solution. The end goal is a model that captures the signal in the data (low bias) without capturing the noise (low variance) – thereby avoiding underfitting and overfitting, and delivering reliable performance on real-world data.

References:

Y. LeCun, et al., “Deep Learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
I. Goodfellow, Y. Bengio, and A. Courville, “Bias–Variance Tradeoff,” in Deep Learning, MIT Press, 2016, ch. 5, sec. 2, pp. 110–114.
Scikit-Learn Documentation – “Underfitting vs. Overfitting” (example), 2021.
Wikipedia – “Overfitting,” Wikimedia Foundation, 2023.
S. B. Akın, “Bias-Variance Trade-Off: Overfitting/Underfitting,” Medium, Sep. 2022.
GeeksforGeeks – “Learning Curve to Identify Overfit & Underfit,” Feb. 2024.
Lark AI Glossary – “Overfitting and Underfitting (Examples),” 2023.
FutureBee AI – “Overfitting & Underfitting in NLP,” Mar. 2023.
Google Developers – “Overfitting: Interpreting Loss Curves,” Machine Learning Crash Course, 2019.
NumberAnalytics – “Overfitting Prevention Techniques,” Jul. 2023.

이 블로그 검색

7 A.I. Workers for WLB