Subjects/Technology/Data and AI/Machine Learning/Supervised learning

Supervised learning - Training and Regularization Strategies

Understand empirical and structural risk minimization, regularization strategies with bias‑variance tradeoffs, and generative training methods.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

What is empirical risk minimization equivalent to when the model outputs a conditional probability distribution and the loss is negative log-likelihood?

1 of 8

Summary

Training Approaches Introduction When training machine learning models, we face a fundamental challenge: how do we choose which parameters produce the best model? Different training approaches answer this question in different ways. This section covers three key frameworks for model training: empirical risk minimization, structural risk minimization, and generative training. Understanding these approaches will help you see how different learning algorithms relate to each other and how to prevent models from overfitting. Empirical Risk Minimization Empirical risk minimization (ERM) is a straightforward principle: choose the model parameters that minimize the average loss on your training data. $$\text{minimize} \quad \frac{1}{n}\sum{i=1}^{n} L(yi, \hat{y}i)$$ where $L$ is your chosen loss function, and $yi$ are the true labels while $\hat{y}i$ are your model's predictions. Connection to Maximum Likelihood Estimation Here's an important insight: when your model outputs a conditional probability distribution $P(y | \mathbf{x})$ and you use negative log-likelihood as your loss function, empirical risk minimization becomes equivalent to maximum likelihood estimation (MLE). To see why, recall that the log-likelihood of your data is: $$\log L(\theta) = \sum{i=1}^{n} \log P(yi | \mathbf{x}i; \theta)$$ Minimizing the average negative log-likelihood is: $$\text{minimize} \quad -\frac{1}{n}\sum{i=1}^{n} \log P(yi | \mathbf{x}i; \theta)$$ This is exactly maximum likelihood estimation, just divided by a constant. This connection is important because it means many supervised learning algorithms can be understood as finding the parameters that make the observed data most likely under your model. The Overfitting Problem While ERM is theoretically appealing, it has a critical weakness: it can lead to overfitting. Overfitting occurs when your model learns the training data too well, capturing noise rather than true patterns, which causes poor performance on new test data. This happens in two scenarios: When your hypothesis space is large: With many possible model parameters to choose from, it becomes easier to find parameters that fit the training data perfectly simply by chance. When training data are insufficient: Fewer examples make it harder to distinguish between real patterns and random noise in the data. The solution to overfitting is regularization, which we address next. Structural Risk Minimization Structural risk minimization (SRM) improves upon ERM by adding a penalty term to discourage overfitting. Instead of minimizing just the training loss, we minimize: $$\text{minimize} \quad \frac{1}{n}\sum{i=1}^{n} L(yi, \hat{y}i) + \lambda R(\mathbf{w})$$ where $R(\mathbf{w})$ is a regularization penalty and $\lambda$ is a hyperparameter that controls how much we penalize complexity. Types of Regularization Penalties For linear models, the most common penalty is the L2 regularization (also called ridge regression): $$R(\mathbf{w}) = \|\mathbf{w}\|2^2 = \sum{j} wj^2$$ This penalty discourages the model from having large weights, which reduces overfitting. It's popular because it's smooth and computationally convenient. Two other important penalties exist: L1 regularization (also called LASSO): $R(\mathbf{w}) = \|\mathbf{w}\|1 = \sum{j} |wj|$. This sums the absolute values of weights. L1 has an interesting property: it tends to force some weights exactly to zero, effectively performing automatic feature selection. Zero-norm (also called L0): $R(\mathbf{w}) = \|\mathbf{w}\|0 = \text{count of non-zero weights}$. This directly counts how many weights are non-zero, but it's computationally difficult to optimize. The Bias-Variance Tradeoff The regularization parameter $\lambda$ controls a fundamental tradeoff in machine learning. When you increase $\lambda$: Bias increases: The model is more constrained, so it may underfit the data and produce biased predictions. Variance decreases: The model is more stable across different training datasets, reducing sensitivity to specific samples. When you decrease $\lambda$: Bias decreases: The model fits the training data better. Variance increases: The model becomes more sensitive to noise in the training data. Your goal is to find the sweet spot where bias and variance are balanced. Selecting the Regularization Parameter You cannot simply train your model and pick the $\lambda$ that performs best on your training data—that would reintroduce overfitting. Instead, use cross-validation: Split your data into multiple folds (typically 5 or 10). For each candidate value of $\lambda$: Train the model on most folds Evaluate on the held-out fold Average the performance across folds Select the $\lambda$ with the best cross-validation performance. This approach gives an honest estimate of how each $\lambda$ value will perform on unseen data. Generative Training So far we've discussed discriminative approaches that directly model $P(y|\mathbf{x})$. Generative training takes a different perspective: model the joint probability distribution $P(\mathbf{x}, y)$ instead. To make predictions, you still compute $P(y|\mathbf{x})$ using Bayes' rule: $$P(y|\mathbf{x}) = \frac{P(\mathbf{x}, y)}{P(\mathbf{x})} = \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})}$$ Generative models use negative log-likelihood of the joint distribution as their loss: $$\text{minimize} \quad -\sum{i=1}^{n} \log P(\mathbf{x}i, yi)$$ Advantages of Generative Training Generative models offer important computational and practical advantages: Closed-form solutions: Many generative models admit analytical solutions that can be computed directly without iterative optimization. For example, naive Bayes and linear discriminant analysis (LDA) both have closed-form solutions for their parameters. Computational efficiency: Since you can often solve for optimal parameters analytically, training is very fast compared to iterative optimization methods. Theoretical interpretability: The explicit probabilistic model makes it easier to understand what your model is learning. However, generative models make stronger assumptions about your data (the form of $P(\mathbf{x}|y)$ and $P(y)$), which can hurt performance if those assumptions are violated.

Flashcards

What is empirical risk minimization equivalent to when the model outputs a conditional probability distribution and the loss is negative log-likelihood?

Maximum likelihood estimation

Under what two conditions can empirical risk minimization lead to overfitting?

Large hypothesis space or insufficient training data

What is a common regularization penalty used for linear models in structural risk minimization?

The squared Euclidean norm of the weight vector ($\| \mathbf{w} \|2^2$)

Besides the squared Euclidean norm, what are two other common penalties used for regularization?

The $\| \mathbf{w} \|1$ norm (sum of absolute values) The "zero-norm" (count of non-zero weights)

How does a larger regularization parameter $\lambda$ affect the bias-variance tradeoff?

It increases bias and reduces variance

What technique is used to select an appropriate value for the regularization parameter $\lambda$?

Cross-validation

How does generative training treat the model in terms of probability distributions?

As a joint probability distribution

What loss function is used during generative training?

Negative log-likelihood

Quiz

How does increasing the regularization parameter λ affect the bias–variance trade‑off in a model?

1 of 2

Key Concepts

Model Evaluation and Selection

Empirical Risk Minimization

Maximum Likelihood Estimation

Overfitting

Structural Risk Minimization

Bias–Variance Tradeoff

Cross‑Validation

Regularization Techniques

L2 Regularization (Ridge)

L1 Regularization (Lasso)

Generative Models

Generative Training

Naive Bayes Classifier

Linear Discriminant Analysis

Definitions

Empirical Risk Minimization

A learning principle that selects a model minimizing the average loss over the training data.

Maximum Likelihood Estimation

A statistical method that chooses model parameters maximizing the probability of observed data.

Overfitting

A modeling error where a model captures noise in the training data, leading to poor generalization.

Structural Risk Minimization

A framework that balances empirical error and model complexity to improve generalization.

L2 Regularization (Ridge)

A penalty term equal to the squared Euclidean norm of model weights, discouraging large coefficients.

L1 Regularization (Lasso)

A penalty term equal to the sum of absolute values of model weights, promoting sparsity.

Bias–Variance Tradeoff

The balance between model simplicity (bias) and flexibility (variance) affecting prediction error.

Cross‑Validation

A technique for assessing model performance by partitioning data into training and validation subsets.

Generative Training

Learning approach that models the joint probability distribution of inputs and outputs using likelihood.

Naive Bayes Classifier

A generative probabilistic model that assumes feature independence given the class label.

Linear Discriminant Analysis

A generative classification method that models class-conditional Gaussian distributions with shared covariance.