Subjects/Technology/Data and AI/Machine Learning/Supervised learning

Supervised learning Study Guide

Study Guide

📖 Core Concepts Supervised learning – algorithm learns a mapping $f:\mathbf{x}\rightarrow y$ from labeled examples $(\mathbf{x}i, yi)$. Generalization error – how well the trained model predicts unseen data. Bias–variance trade‑off – bias = systematic error (under‑fitting); variance = sensitivity to training‑set fluctuations (over‑fitting). Inductive bias – the assumptions (e.g., linearity, smoothness) an algorithm makes to generalize beyond the data it has seen. Empirical Risk Minimization (ERM) – choose $f$ that minimizes average loss $$\hat{R}(f)=\frac{1}{n}\sum{i=1}^{n}L\bigl(f(\mathbf{x}i),yi\bigr).$$ Structural Risk Minimization (SRM) – ERM + regularization penalty $\Omega(f)$ to control model complexity: $$\minf \; \hat{R}(f)+\lambda\Omega(f).$$ No Free Lunch – no single algorithm dominates across all problems; algorithm choice must match data characteristics and resources. 📌 Must Remember Training vs. test – always evaluate on data not used for training (hold‑out set or cross‑validation). Regularization: $\ell2$ norm $\| \mathbf{w} \|2^2$ → shrinks weights, reduces variance. $\ell1$ norm $\| \mathbf{w} \|1$ → promotes sparsity (feature selection). Larger $\lambda$ → higher bias, lower variance. Bias‑high / variance‑low → simple model, few parameters, works when data are scarce or noisy. Bias‑low / variance‑high → flexible model (e.g., deep nets), needs lots of data and careful regularization. Key loss functions: Squared error for regression. Log‑loss (negative log‑likelihood) for probabilistic classifiers (logistic regression, naive Bayes). Common algorithm strengths: Linear models → fast, good for high‑dimensional sparse data. Decision trees / ensembles → handle heterogeneous, categorical features, capture interactions. k‑NN → simple, non‑parametric, sensitive to irrelevant features / scaling. SVM → effective in high‑dimensional spaces with kernels. Neural nets → learn complex non‑linear mappings given large data. 🔄 Key Processes Problem Formulation – define feature vector $\mathbf{x}$ and label $y$. Algorithm Selection – match problem (size, dimensionality, noise, feature type) to algorithm bias‑variance profile. Data Pre‑processing – Scale/normalize numeric features (important for distance‑based methods). Encode categorical variables. Optional: feature selection or dimensionality reduction (PCA, LDA). Model Training – fit parameters by minimizing ERM (or MLE for probabilistic models). Regularization & Hyper‑parameter Tuning – use cross‑validation to pick $\lambda$, $k$ (k‑NN), kernel parameters, depth, etc. Model Evaluation – compute validation/test error, confusion matrix, ROC‑AUC, etc. Calibration (if needed) – adjust predicted class probabilities (e.g., Platt scaling, isotonic regression). 🔍 Key Comparisons Linear Regression vs. Logistic Regression Linear: predicts continuous value; loss = squared error. Logistic: predicts probability of binary class; loss = log‑loss, applies sigmoid. Decision Tree vs. k‑NN Tree: learns explicit decision rules, handles mixed data, insensitive to feature scaling. k‑NN: instance‑based, requires distance metric, highly sensitive to irrelevant features and scaling. SVM (hard‑margin) vs. Soft‑margin (with $\lambda$) Hard‑margin: assumes data are perfectly separable, can overfit noisy data. Soft‑margin: adds slack variables/regularization to tolerate misclassifications. Generative (Naive Bayes, LDA) vs. Discriminative (Logistic, SVM) Generative: model joint $p(\mathbf{x},y)$, often have closed‑form solutions, need strong independence assumptions. Discriminative: model conditional $p(y|\mathbf{x})$ or decision boundary directly, usually higher predictive accuracy. ⚠️ Common Misunderstandings “More features always improve performance.” → High dimensionality can inflate variance; irrelevant/redundant features hurt linear & distance‑based methods. “Regularization always harms accuracy.” – It reduces variance, often improving test performance, especially with limited data. “k‑NN is non‑parametric, so no overfitting.” – With small $k$ and noisy data, k‑NN can overfit; larger $k$ increases bias. “A higher training accuracy means a better model.” – Might indicate overfitting; always compare to validation/test error. 🧠 Mental Models / Intuition Bias–Variance Slider – Imagine a slider: left = high bias/low variance (simple line), right = low bias/high variance (wiggly curve). Your job is to stop near the sweet spot where the curve fits the underlying trend but not the noise. Regularization as “tension” – Adding $\lambda$ pulls model weights toward zero, smoothing predictions like a rubber band stretched over data points. Hypothesis Space $\mathcal{H}$ – Think of $\mathcal{H}$ as the “toolbox” an algorithm can reach into; a bigger toolbox gives flexibility (low bias) but also more ways to pick the wrong tool (high variance). 🚩 Exceptions & Edge Cases Noisy labels – Prefer higher‑bias models (e.g., linear SVM with strong regularization) and consider early stopping. Very small datasets – Linear models with strong regularization or Naive Bayes often outperform complex models. Highly imbalanced classes – Accuracy is misleading; use precision/recall, ROC‑AUC, or apply class weighting / resampling. Heterogeneous feature types – Decision trees handle mixed numeric/categorical data natively; other algorithms need careful preprocessing. 📍 When to Use Which | Situation | Recommended Algorithm(s) | Reason | |-----------|---------------------------|--------| | Few samples, many features | Linear/Logistic regression with $\ell1$ regularization, Naive Bayes | High bias tolerates scarcity; regularization combats overfitting. | | Complex non‑linear relationships, ample data | Neural networks, ensemble trees (random forest, boosting) | Low bias, high capacity; regularization via early stopping/boosting reduces variance. | | Clear margin between classes, possibly high‑dimensional | SVM with appropriate kernel | Maximizes margin, works well with kernels for non‑linear separations. | | Need interpretable rules | Decision tree, logistic regression | Provides explicit decision paths or coefficient interpretation. | | Similarity‑based prediction (e.g., recommendation) | k‑NN | Non‑parametric, simple to implement; ensure proper scaling. | | Probabilistic output required & independence approx. holds | Naive Bayes, LDA | Efficient, closed‑form solutions; good baseline. | 👀 Patterns to Recognize “High variance → many parameters, flexible model, large training set needed.” “Feature redundancy → linear or distance‑based models become unstable; look for regularization or PCA.” “Noise in labels → training loss keeps decreasing while validation loss plateaus or rises (classic overfit curve).” “Decision boundary looks linear in transformed space → consider kernel SVM.” “Model predictions are over‑confident (probabilities near 0 or 1) → check calibration.” 🗂️ Exam Traps Confusing bias with under‑fitting – a model can have low bias but still overfit if variance is high. Choosing k‑NN with unscaled features – distance dominated by large‑scale variables → wrong neighbor selection. Assuming “more layers = better NN” – without enough data or regularization, deeper nets just increase variance. Mixing up loss functions – using squared error for classification leads to poor probability estimates; log‑loss is required for probabilistic classifiers. Ignoring the regularization parameter – setting $\lambda=0$ in SRM reduces to pure ERM → high risk of overfitting. Selecting algorithm solely on training accuracy – exam questions often present a training‑set‑perfect model; the correct answer will emphasize validation performance or bias‑variance reasoning. --- Use this guide for a quick “last‑minute” review – focus on the core concepts, the bias‑variance intuition, and the decision‑rules for algorithm choice. Good luck!

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or