RemNote Community
Community

Supervised learning - Training and Regularization Strategies

Understand empirical and structural risk minimization, regularization strategies with bias‑variance tradeoffs, and generative training methods.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What is empirical risk minimization equivalent to when the model outputs a conditional probability distribution and the loss is negative log-likelihood?
1 of 8

Summary

Training Approaches Introduction When training machine learning models, we face a fundamental challenge: how do we choose which parameters produce the best model? Different training approaches answer this question in different ways. This section covers three key frameworks for model training: empirical risk minimization, structural risk minimization, and generative training. Understanding these approaches will help you see how different learning algorithms relate to each other and how to prevent models from overfitting. Empirical Risk Minimization Empirical risk minimization (ERM) is a straightforward principle: choose the model parameters that minimize the average loss on your training data. $$\text{minimize} \quad \frac{1}{n}\sum{i=1}^{n} L(yi, \hat{y}i)$$ where $L$ is your chosen loss function, and $yi$ are the true labels while $\hat{y}i$ are your model's predictions. Connection to Maximum Likelihood Estimation Here's an important insight: when your model outputs a conditional probability distribution $P(y | \mathbf{x})$ and you use negative log-likelihood as your loss function, empirical risk minimization becomes equivalent to maximum likelihood estimation (MLE). To see why, recall that the log-likelihood of your data is: $$\log L(\theta) = \sum{i=1}^{n} \log P(yi | \mathbf{x}i; \theta)$$ Minimizing the average negative log-likelihood is: $$\text{minimize} \quad -\frac{1}{n}\sum{i=1}^{n} \log P(yi | \mathbf{x}i; \theta)$$ This is exactly maximum likelihood estimation, just divided by a constant. This connection is important because it means many supervised learning algorithms can be understood as finding the parameters that make the observed data most likely under your model. The Overfitting Problem While ERM is theoretically appealing, it has a critical weakness: it can lead to overfitting. Overfitting occurs when your model learns the training data too well, capturing noise rather than true patterns, which causes poor performance on new test data. This happens in two scenarios: When your hypothesis space is large: With many possible model parameters to choose from, it becomes easier to find parameters that fit the training data perfectly simply by chance. When training data are insufficient: Fewer examples make it harder to distinguish between real patterns and random noise in the data. The solution to overfitting is regularization, which we address next. Structural Risk Minimization Structural risk minimization (SRM) improves upon ERM by adding a penalty term to discourage overfitting. Instead of minimizing just the training loss, we minimize: $$\text{minimize} \quad \frac{1}{n}\sum{i=1}^{n} L(yi, \hat{y}i) + \lambda R(\mathbf{w})$$ where $R(\mathbf{w})$ is a regularization penalty and $\lambda$ is a hyperparameter that controls how much we penalize complexity. Types of Regularization Penalties For linear models, the most common penalty is the L2 regularization (also called ridge regression): $$R(\mathbf{w}) = \|\mathbf{w}\|2^2 = \sum{j} wj^2$$ This penalty discourages the model from having large weights, which reduces overfitting. It's popular because it's smooth and computationally convenient. Two other important penalties exist: L1 regularization (also called LASSO): $R(\mathbf{w}) = \|\mathbf{w}\|1 = \sum{j} |wj|$. This sums the absolute values of weights. L1 has an interesting property: it tends to force some weights exactly to zero, effectively performing automatic feature selection. Zero-norm (also called L0): $R(\mathbf{w}) = \|\mathbf{w}\|0 = \text{count of non-zero weights}$. This directly counts how many weights are non-zero, but it's computationally difficult to optimize. The Bias-Variance Tradeoff The regularization parameter $\lambda$ controls a fundamental tradeoff in machine learning. When you increase $\lambda$: Bias increases: The model is more constrained, so it may underfit the data and produce biased predictions. Variance decreases: The model is more stable across different training datasets, reducing sensitivity to specific samples. When you decrease $\lambda$: Bias decreases: The model fits the training data better. Variance increases: The model becomes more sensitive to noise in the training data. Your goal is to find the sweet spot where bias and variance are balanced. Selecting the Regularization Parameter You cannot simply train your model and pick the $\lambda$ that performs best on your training data—that would reintroduce overfitting. Instead, use cross-validation: Split your data into multiple folds (typically 5 or 10). For each candidate value of $\lambda$: Train the model on most folds Evaluate on the held-out fold Average the performance across folds Select the $\lambda$ with the best cross-validation performance. This approach gives an honest estimate of how each $\lambda$ value will perform on unseen data. Generative Training So far we've discussed discriminative approaches that directly model $P(y|\mathbf{x})$. Generative training takes a different perspective: model the joint probability distribution $P(\mathbf{x}, y)$ instead. To make predictions, you still compute $P(y|\mathbf{x})$ using Bayes' rule: $$P(y|\mathbf{x}) = \frac{P(\mathbf{x}, y)}{P(\mathbf{x})} = \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})}$$ Generative models use negative log-likelihood of the joint distribution as their loss: $$\text{minimize} \quad -\sum{i=1}^{n} \log P(\mathbf{x}i, yi)$$ Advantages of Generative Training Generative models offer important computational and practical advantages: Closed-form solutions: Many generative models admit analytical solutions that can be computed directly without iterative optimization. For example, naive Bayes and linear discriminant analysis (LDA) both have closed-form solutions for their parameters. Computational efficiency: Since you can often solve for optimal parameters analytically, training is very fast compared to iterative optimization methods. Theoretical interpretability: The explicit probabilistic model makes it easier to understand what your model is learning. However, generative models make stronger assumptions about your data (the form of $P(\mathbf{x}|y)$ and $P(y)$), which can hurt performance if those assumptions are violated.
Flashcards
What is empirical risk minimization equivalent to when the model outputs a conditional probability distribution and the loss is negative log-likelihood?
Maximum likelihood estimation
Under what two conditions can empirical risk minimization lead to overfitting?
Large hypothesis space or insufficient training data
What is a common regularization penalty used for linear models in structural risk minimization?
The squared Euclidean norm of the weight vector ($\| \mathbf{w} \|2^2$)
Besides the squared Euclidean norm, what are two other common penalties used for regularization?
The $\| \mathbf{w} \|1$ norm (sum of absolute values) The "zero-norm" (count of non-zero weights)
How does a larger regularization parameter $\lambda$ affect the bias-variance tradeoff?
It increases bias and reduces variance
What technique is used to select an appropriate value for the regularization parameter $\lambda$?
Cross-validation
How does generative training treat the model in terms of probability distributions?
As a joint probability distribution
What loss function is used during generative training?
Negative log-likelihood

Quiz

How does increasing the regularization parameter λ affect the bias–variance trade‑off in a model?
1 of 2
Key Concepts
Model Evaluation and Selection
Empirical Risk Minimization
Maximum Likelihood Estimation
Overfitting
Structural Risk Minimization
Bias–Variance Tradeoff
Cross‑Validation
Regularization Techniques
L2 Regularization (Ridge)
L1 Regularization (Lasso)
Generative Models
Generative Training
Naive Bayes Classifier
Linear Discriminant Analysis