Supervised learning Study Guide
Study Guide
📖 Core Concepts
Supervised learning – algorithm learns a mapping \(f:\mathbf{x}\rightarrow y\) from labeled examples \((\mathbf{x}i, yi)\).
Generalization error – how well the trained model predicts unseen data.
Bias–variance trade‑off – bias = systematic error (under‑fitting); variance = sensitivity to training‑set fluctuations (over‑fitting).
Inductive bias – the assumptions (e.g., linearity, smoothness) an algorithm makes to generalize beyond the data it has seen.
Empirical Risk Minimization (ERM) – choose \(f\) that minimizes average loss
$$\hat{R}(f)=\frac{1}{n}\sum{i=1}^{n}L\bigl(f(\mathbf{x}i),yi\bigr).$$
Structural Risk Minimization (SRM) – ERM + regularization penalty \(\Omega(f)\) to control model complexity:
$$\minf \; \hat{R}(f)+\lambda\Omega(f).$$
No Free Lunch – no single algorithm dominates across all problems; algorithm choice must match data characteristics and resources.
📌 Must Remember
Training vs. test – always evaluate on data not used for training (hold‑out set or cross‑validation).
Regularization:
\(\ell2\) norm \(\| \mathbf{w} \|2^2\) → shrinks weights, reduces variance.
\(\ell1\) norm \(\| \mathbf{w} \|1\) → promotes sparsity (feature selection).
Larger \(\lambda\) → higher bias, lower variance.
Bias‑high / variance‑low → simple model, few parameters, works when data are scarce or noisy.
Bias‑low / variance‑high → flexible model (e.g., deep nets), needs lots of data and careful regularization.
Key loss functions:
Squared error for regression.
Log‑loss (negative log‑likelihood) for probabilistic classifiers (logistic regression, naive Bayes).
Common algorithm strengths:
Linear models → fast, good for high‑dimensional sparse data.
Decision trees / ensembles → handle heterogeneous, categorical features, capture interactions.
k‑NN → simple, non‑parametric, sensitive to irrelevant features / scaling.
SVM → effective in high‑dimensional spaces with kernels.
Neural nets → learn complex non‑linear mappings given large data.
🔄 Key Processes
Problem Formulation – define feature vector \(\mathbf{x}\) and label \(y\).
Algorithm Selection – match problem (size, dimensionality, noise, feature type) to algorithm bias‑variance profile.
Data Pre‑processing –
Scale/normalize numeric features (important for distance‑based methods).
Encode categorical variables.
Optional: feature selection or dimensionality reduction (PCA, LDA).
Model Training – fit parameters by minimizing ERM (or MLE for probabilistic models).
Regularization & Hyper‑parameter Tuning – use cross‑validation to pick \(\lambda\), \(k\) (k‑NN), kernel parameters, depth, etc.
Model Evaluation – compute validation/test error, confusion matrix, ROC‑AUC, etc.
Calibration (if needed) – adjust predicted class probabilities (e.g., Platt scaling, isotonic regression).
🔍 Key Comparisons
Linear Regression vs. Logistic Regression
Linear: predicts continuous value; loss = squared error.
Logistic: predicts probability of binary class; loss = log‑loss, applies sigmoid.
Decision Tree vs. k‑NN
Tree: learns explicit decision rules, handles mixed data, insensitive to feature scaling.
k‑NN: instance‑based, requires distance metric, highly sensitive to irrelevant features and scaling.
SVM (hard‑margin) vs. Soft‑margin (with \(\lambda\))
Hard‑margin: assumes data are perfectly separable, can overfit noisy data.
Soft‑margin: adds slack variables/regularization to tolerate misclassifications.
Generative (Naive Bayes, LDA) vs. Discriminative (Logistic, SVM)
Generative: model joint \(p(\mathbf{x},y)\), often have closed‑form solutions, need strong independence assumptions.
Discriminative: model conditional \(p(y|\mathbf{x})\) or decision boundary directly, usually higher predictive accuracy.
⚠️ Common Misunderstandings
“More features always improve performance.” → High dimensionality can inflate variance; irrelevant/redundant features hurt linear & distance‑based methods.
“Regularization always harms accuracy.” – It reduces variance, often improving test performance, especially with limited data.
“k‑NN is non‑parametric, so no overfitting.” – With small \(k\) and noisy data, k‑NN can overfit; larger \(k\) increases bias.
“A higher training accuracy means a better model.” – Might indicate overfitting; always compare to validation/test error.
🧠 Mental Models / Intuition
Bias–Variance Slider – Imagine a slider: left = high bias/low variance (simple line), right = low bias/high variance (wiggly curve). Your job is to stop near the sweet spot where the curve fits the underlying trend but not the noise.
Regularization as “tension” – Adding \(\lambda\) pulls model weights toward zero, smoothing predictions like a rubber band stretched over data points.
Hypothesis Space \(\mathcal{H}\) – Think of \(\mathcal{H}\) as the “toolbox” an algorithm can reach into; a bigger toolbox gives flexibility (low bias) but also more ways to pick the wrong tool (high variance).
🚩 Exceptions & Edge Cases
Noisy labels – Prefer higher‑bias models (e.g., linear SVM with strong regularization) and consider early stopping.
Very small datasets – Linear models with strong regularization or Naive Bayes often outperform complex models.
Highly imbalanced classes – Accuracy is misleading; use precision/recall, ROC‑AUC, or apply class weighting / resampling.
Heterogeneous feature types – Decision trees handle mixed numeric/categorical data natively; other algorithms need careful preprocessing.
📍 When to Use Which
| Situation | Recommended Algorithm(s) | Reason |
|-----------|---------------------------|--------|
| Few samples, many features | Linear/Logistic regression with \(\ell1\) regularization, Naive Bayes | High bias tolerates scarcity; regularization combats overfitting. |
| Complex non‑linear relationships, ample data | Neural networks, ensemble trees (random forest, boosting) | Low bias, high capacity; regularization via early stopping/boosting reduces variance. |
| Clear margin between classes, possibly high‑dimensional | SVM with appropriate kernel | Maximizes margin, works well with kernels for non‑linear separations. |
| Need interpretable rules | Decision tree, logistic regression | Provides explicit decision paths or coefficient interpretation. |
| Similarity‑based prediction (e.g., recommendation) | k‑NN | Non‑parametric, simple to implement; ensure proper scaling. |
| Probabilistic output required & independence approx. holds | Naive Bayes, LDA | Efficient, closed‑form solutions; good baseline. |
👀 Patterns to Recognize
“High variance → many parameters, flexible model, large training set needed.”
“Feature redundancy → linear or distance‑based models become unstable; look for regularization or PCA.”
“Noise in labels → training loss keeps decreasing while validation loss plateaus or rises (classic overfit curve).”
“Decision boundary looks linear in transformed space → consider kernel SVM.”
“Model predictions are over‑confident (probabilities near 0 or 1) → check calibration.”
🗂️ Exam Traps
Confusing bias with under‑fitting – a model can have low bias but still overfit if variance is high.
Choosing k‑NN with unscaled features – distance dominated by large‑scale variables → wrong neighbor selection.
Assuming “more layers = better NN” – without enough data or regularization, deeper nets just increase variance.
Mixing up loss functions – using squared error for classification leads to poor probability estimates; log‑loss is required for probabilistic classifiers.
Ignoring the regularization parameter – setting \(\lambda=0\) in SRM reduces to pure ERM → high risk of overfitting.
Selecting algorithm solely on training accuracy – exam questions often present a training‑set‑perfect model; the correct answer will emphasize validation performance or bias‑variance reasoning.
---
Use this guide for a quick “last‑minute” review – focus on the core concepts, the bias‑variance intuition, and the decision‑rules for algorithm choice. Good luck!
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or