Unsupervised learning Study Guide
Study Guide
📖 Core Concepts
Unsupervised learning – algorithms discover patterns without any labels.
Self‑supervision – creates its own training signal from the raw data (e.g., masking tokens in BERT).
Generative vs. discriminative – generative models learn to produce or reconstruct data; discriminative models focus on recognition/classification (often via clustering).
Latent variable model – observed data are generated from hidden (latent) variables (e.g., topics in a document).
Reconstruction error – the difference between input and output that drives learning in autoencoders.
Density estimation – learning the full probability distribution of the data, not just a conditional label distribution.
---
📌 Must Remember
k‑means objective: minimize within‑cluster variance
$$J = \sum{i=1}^{k}\sum{x \in Ci} \|x-\mui\|^{2}$$
EM algorithm – alternates E‑step (estimate hidden variables) and M‑step (maximize parameters).
Method of moments – matches empirical moments (mean, covariance) to their theoretical expressions to solve for parameters.
Hebbian rule: “neurons that fire together, wire together.”
SOM topology: nearby map units represent similar inputs → topographic ordering.
DBSCAN: clusters are dense regions; points in low‑density areas are labeled noise.
Isolation Forest: anomalies require fewer splits to isolate → shorter path length.
Autoencoder training: minimize reconstruction loss (e.g., mean‑squared error).
---
🔄 Key Processes
k‑means clustering
Initialize \(k\) centroids.
Assign each point to the nearest centroid.
Re‑compute centroids as the mean of assigned points.
Repeat steps 2–3 until assignments stop changing.
EM for mixture models
E‑step: Compute posterior probabilities \(p(zi \mid xi, \theta^{\text{old}})\).
M‑step: Update parameters \(\theta\) to maximize the expected complete‑data log‑likelihood.
Isolation Forest building
Randomly select a feature and a split value.
Recursively partition until each point is isolated or a depth limit is reached.
Anomaly score = average path length across many trees (shorter ⇒ more anomalous).
Autoencoder training
Encode: \(h = f{\text{enc}}(x)\).
Decode: \(\hat{x}=f{\text{dec}}(h)\).
Compute loss \(L = \|x-\hat{x}\|^{2}\) and back‑propagate to update weights.
Method‑of‑moments estimation
Compute sample moments (e.g., \(\hat{\mu} = \frac{1}{n}\sum xi\)).
Set them equal to the theoretical moments expressed in terms of unknown parameters and solve.
---
🔍 Key Comparisons
k‑means vs. DBSCAN
k‑means: assumes spherical clusters, needs k, sensitive to outliers.
DBSCAN: finds arbitrarily shaped clusters, no k, automatically labels noise.
EM vs. Method of Moments
EM: iterative, may get stuck in local optima, guarantees non‑decreasing likelihood.
Method of Moments: closed‑form (when solvable), no likelihood, can be less statistically efficient.
Hebbian learning vs. Backpropagation
Hebbian: unsupervised, strengthens connections only on co‑activation.
Backprop: supervised (or reconstruction‑based), uses error gradients to update all weights.
Self‑Organizing Map vs. Autoencoder
SOM: produces a topographic map, mainly for visualization and clustering.
Autoencoder: learns a compact latent code for reconstruction or downstream tasks.
---
⚠️ Common Misunderstandings
“Unsupervised = no learning signal.”
Reality: self‑supervision, reconstruction loss, or density estimation provide strong learning signals.
“Clustering always yields meaningful groups.”
Reality: quality depends on feature space; poorly chosen features → arbitrary clusters.
“EM always finds the global optimum.”
Reality: EM can converge to local maxima; initialization matters.
“Higher dimensionality always hurts performance.”
Reality: Dimensionality reduction (PCA, SVD) can improve downstream tasks by removing noise.
---
🧠 Mental Models / Intuition
Reconstruction as a “copy‑test”: if a network can perfectly rebuild its input, it must have captured the underlying structure.
Density ≈ “landscape”: think of data points as marbles rolling on a terrain; dense valleys = clusters, high plateaus = anomalies.
EM as “expectation‑then‑update”: imagine guessing hidden labels (E‑step) and then polishing the model based on those guesses (M‑step).
---
🚩 Exceptions & Edge Cases
k‑means fails on clusters with different sizes/variances or non‑convex shapes.
DBSCAN struggles when clusters have varying densities; parameters \(\varepsilon\) and MinPts become critical.
Method of moments may be ill‑posed if moments do not uniquely identify parameters (e.g., symmetric distributions).
SOM requires careful grid size selection; too small → loss of detail, too large → over‑fragmentation.
---
📍 When to Use Which
| Situation | Preferred Method | Why |
|-----------|------------------|-----|
| Quick, large‑scale segmentation with roughly spherical clusters | k‑means | Linear time, easy to implement |
| Data with arbitrary shape and unknown number of clusters | DBSCAN / OPTICS | Density‑based, handles noise |
| Need probabilistic interpretation of clusters | Mixture models + EM | Gives soft assignments and likelihood |
| Want a compact representation for downstream supervised task | Autoencoder (or other latent‑variable model) | Learns meaningful embeddings |
| Detect rare, anomalous events in streaming logs | Isolation Forest or LOF | Scales well, unsupervised |
| Estimate parameters of a model without iterative likelihood | Method of Moments | Closed‑form when applicable |
| Visualize high‑dimensional data preserving topology | Self‑Organizing Map | Maps similarity to 2‑D layout |
| Pre‑train large language or vision models | Self‑supervised reconstruction / generative pre‑training | Leverages massive unlabeled data |
---
👀 Patterns to Recognize
“Dense‑core + sparse‑border” → likely a DBSCAN‑style cluster.
“Sharp drop in within‑cluster variance after adding a centoid” → elbow point for choosing k in k‑means.
“Shorter isolation tree path length” → potential anomaly (Isolation Forest).
“Reconstruction loss plateaus quickly” → autoencoder has captured dominant structure; further depth may overfit.
“EM log‑likelihood stops increasing” → convergence (but verify not a poor local optimum).
---
🗂️ Exam Traps
Choosing k‑means for non‑convex clusters – the answer will look neat but will be penalized because k‑means assumes spherical shapes.
Stating EM always converges to the true parameters – EM only guarantees non‑decreasing likelihood, not global optimality.
Confusing “density estimation” with “conditional classification” – density estimation models \(p(x)\), not \(p(y|x)\).
Mixing up Hebbian learning with backpropagation – Hebbian is unsupervised and local; backprop uses global error signals.
Assuming Isolation Forest needs labeled anomalies – it is completely unsupervised; the trap is to think labels are required.
---
or
Or, immediately create your own study flashcards:
Upload a PDF.
Master Study Materials.
Master Study Materials.
Start learning in seconds
Drop your PDFs here or
or