Subjects/Technology/Data and AI/Machine Learning/Unsupervised learning

Unsupervised learning Study Guide

Study Guide

📖 Core Concepts Unsupervised learning – algorithms discover patterns without any labels. Self‑supervision – creates its own training signal from the raw data (e.g., masking tokens in BERT). Generative vs. discriminative – generative models learn to produce or reconstruct data; discriminative models focus on recognition/classification (often via clustering). Latent variable model – observed data are generated from hidden (latent) variables (e.g., topics in a document). Reconstruction error – the difference between input and output that drives learning in autoencoders. Density estimation – learning the full probability distribution of the data, not just a conditional label distribution. --- 📌 Must Remember k‑means objective: minimize within‑cluster variance $$J = \sum{i=1}^{k}\sum{x \in Ci} \|x-\mui\|^{2}$$ EM algorithm – alternates E‑step (estimate hidden variables) and M‑step (maximize parameters). Method of moments – matches empirical moments (mean, covariance) to their theoretical expressions to solve for parameters. Hebbian rule: “neurons that fire together, wire together.” SOM topology: nearby map units represent similar inputs → topographic ordering. DBSCAN: clusters are dense regions; points in low‑density areas are labeled noise. Isolation Forest: anomalies require fewer splits to isolate → shorter path length. Autoencoder training: minimize reconstruction loss (e.g., mean‑squared error). --- 🔄 Key Processes k‑means clustering Initialize $k$ centroids. Assign each point to the nearest centroid. Re‑compute centroids as the mean of assigned points. Repeat steps 2–3 until assignments stop changing. EM for mixture models E‑step: Compute posterior probabilities $p(zi \mid xi, \theta^{\text{old}})$. M‑step: Update parameters $\theta$ to maximize the expected complete‑data log‑likelihood. Isolation Forest building Randomly select a feature and a split value. Recursively partition until each point is isolated or a depth limit is reached. Anomaly score = average path length across many trees (shorter ⇒ more anomalous). Autoencoder training Encode: $h = f{\text{enc}}(x)$. Decode: $\hat{x}=f{\text{dec}}(h)$. Compute loss $L = \|x-\hat{x}\|^{2}$ and back‑propagate to update weights. Method‑of‑moments estimation Compute sample moments (e.g., $\hat{\mu} = \frac{1}{n}\sum xi$). Set them equal to the theoretical moments expressed in terms of unknown parameters and solve. --- 🔍 Key Comparisons k‑means vs. DBSCAN k‑means: assumes spherical clusters, needs k, sensitive to outliers. DBSCAN: finds arbitrarily shaped clusters, no k, automatically labels noise. EM vs. Method of Moments EM: iterative, may get stuck in local optima, guarantees non‑decreasing likelihood. Method of Moments: closed‑form (when solvable), no likelihood, can be less statistically efficient. Hebbian learning vs. Backpropagation Hebbian: unsupervised, strengthens connections only on co‑activation. Backprop: supervised (or reconstruction‑based), uses error gradients to update all weights. Self‑Organizing Map vs. Autoencoder SOM: produces a topographic map, mainly for visualization and clustering. Autoencoder: learns a compact latent code for reconstruction or downstream tasks. --- ⚠️ Common Misunderstandings “Unsupervised = no learning signal.” Reality: self‑supervision, reconstruction loss, or density estimation provide strong learning signals. “Clustering always yields meaningful groups.” Reality: quality depends on feature space; poorly chosen features → arbitrary clusters. “EM always finds the global optimum.” Reality: EM can converge to local maxima; initialization matters. “Higher dimensionality always hurts performance.” Reality: Dimensionality reduction (PCA, SVD) can improve downstream tasks by removing noise. --- 🧠 Mental Models / Intuition Reconstruction as a “copy‑test”: if a network can perfectly rebuild its input, it must have captured the underlying structure. Density ≈ “landscape”: think of data points as marbles rolling on a terrain; dense valleys = clusters, high plateaus = anomalies. EM as “expectation‑then‑update”: imagine guessing hidden labels (E‑step) and then polishing the model based on those guesses (M‑step). --- 🚩 Exceptions & Edge Cases k‑means fails on clusters with different sizes/variances or non‑convex shapes. DBSCAN struggles when clusters have varying densities; parameters $\varepsilon$ and MinPts become critical. Method of moments may be ill‑posed if moments do not uniquely identify parameters (e.g., symmetric distributions). SOM requires careful grid size selection; too small → loss of detail, too large → over‑fragmentation. --- 📍 When to Use Which | Situation | Preferred Method | Why | |-----------|------------------|-----| | Quick, large‑scale segmentation with roughly spherical clusters | k‑means | Linear time, easy to implement | | Data with arbitrary shape and unknown number of clusters | DBSCAN / OPTICS | Density‑based, handles noise | | Need probabilistic interpretation of clusters | Mixture models + EM | Gives soft assignments and likelihood | | Want a compact representation for downstream supervised task | Autoencoder (or other latent‑variable model) | Learns meaningful embeddings | | Detect rare, anomalous events in streaming logs | Isolation Forest or LOF | Scales well, unsupervised | | Estimate parameters of a model without iterative likelihood | Method of Moments | Closed‑form when applicable | | Visualize high‑dimensional data preserving topology | Self‑Organizing Map | Maps similarity to 2‑D layout | | Pre‑train large language or vision models | Self‑supervised reconstruction / generative pre‑training | Leverages massive unlabeled data | --- 👀 Patterns to Recognize “Dense‑core + sparse‑border” → likely a DBSCAN‑style cluster. “Sharp drop in within‑cluster variance after adding a centoid” → elbow point for choosing k in k‑means. “Shorter isolation tree path length” → potential anomaly (Isolation Forest). “Reconstruction loss plateaus quickly” → autoencoder has captured dominant structure; further depth may overfit. “EM log‑likelihood stops increasing” → convergence (but verify not a poor local optimum). --- 🗂️ Exam Traps Choosing k‑means for non‑convex clusters – the answer will look neat but will be penalized because k‑means assumes spherical shapes. Stating EM always converges to the true parameters – EM only guarantees non‑decreasing likelihood, not global optimality. Confusing “density estimation” with “conditional classification” – density estimation models $p(x)$, not $p(y|x)$. Mixing up Hebbian learning with backpropagation – Hebbian is unsupervised and local; backprop uses global error signals. Assuming Isolation Forest needs labeled anomalies – it is completely unsupervised; the trap is to think labels are required. ---

Or, immediately create your own study flashcards:

Upload a PDF.
Master Study Materials.

Start learning in seconds

Drop your PDFs here or