RemNote Community
Community

Unsupervised learning - Core Techniques and Algorithms

Understand core unsupervised learning techniques such as Hebbian learning, self‑organizing maps, clustering, anomaly detection, and latent‑variable methods like EM and the method of moments.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz

Quick Practice

What is the fundamental principle of Hebbian learning regarding neurons?
1 of 15

Summary

Unsupervised Learning in Neural Networks Introduction Unsupervised learning represents a fundamentally different approach to training neural networks compared to supervised learning. Rather than learning to predict labeled targets, unsupervised models learn the underlying structure of data by finding patterns, reducing dimensionality, or identifying groups without explicit labels. This guide covers the classical principles and algorithms that form the foundation of modern unsupervised learning. The diagram above illustrates how unsupervised learning tasks (imagine pictures, generate videos, model languages) differ from supervised tasks (recognize images, question & answer, analyze sentiments). Unsupervised methods work directly with raw data to discover its inherent organization. Classical Foundations: Hebbian Learning Hebbian learning is one of the oldest and most intuitive principles in neural networks. Proposed by Donald Hebb, it states that neurons that fire together wire together. In other words, when two neurons are active simultaneously, the connection between them should be strengthened. Mathematically, a simplified Hebbian learning rule can be expressed as: $$\Delta w{ij} = \eta \cdot ai \cdot aj$$ where $\Delta w{ij}$ is the change in weight between neurons $i$ and $j$, $\eta$ is a learning rate, and $ai$ and $aj$ are the activation values of the two neurons. The key insight is that Hebbian learning is unsupervised—it requires no labeled data, only the natural correlations between neural activations. This makes it biologically plausible and computationally simple. However, Hebbian learning has limitations: weights can grow unbounded, and the learning is entirely local, based only on coincidence of firing. Self-Organizing Maps (SOMs) The Self-Organizing Map is a classical neural network architecture that learns to create a topographic map of the input space. This means that the spatial organization of neurons in the map reflects the organization of features in the data—inputs with similar properties activate nearby neurons. SOMs are particularly useful for visualization because they reduce high-dimensional data to a typically two-dimensional grid while preserving the neighborhood structure of the data. The key mechanism is: Competitive learning: When an input arrives, the neuron whose weights are most similar to the input (the "winner") is selected. Cooperative learning: Not only the winner, but also its neighbors in the map, have their weights updated to become more similar to the input. This creates a self-organizing effect where the map gradually develops a structure that reflects the input distribution. Distant regions of the map come to represent dissimilar inputs, while nearby regions represent similar ones. Training Objectives in Unsupervised Learning Understanding Reconstruction Error Unlike supervised learning, which minimizes prediction error on labeled targets, unsupervised neural networks typically minimize reconstruction error. The network learns to: Encode input data into a compressed internal representation Decode that representation back to reconstruct the original input Minimize the difference between original and reconstructed data This reconstruction objective forces the network to capture the essential structure of the data in its hidden representations. Learning Algorithms for Unsupervised Networks Several fundamental algorithms power unsupervised learning. Understanding these is crucial because they represent different philosophical approaches to learning model parameters: Variational Inference Variational inference approaches unsupervised learning from a Bayesian perspective. Rather than finding a single best set of parameters, it approximates the full probability distribution over possible parameters and latent variables. This is useful when you want to quantify uncertainty about what the model has learned. Maximum Likelihood Estimation (MLE) Maximum likelihood estimation finds parameters that maximize the probability of observing the actual data. Mathematically, it seeks to maximize: $$\mathcal{L}(\theta) = P(X | \theta)$$ where $X$ is the observed data and $\theta$ represents the model parameters. MLE is intuitive: it finds the parameters that make the observed data as probable as possible. Maximum A Posteriori (MAP) Maximum a posteriori estimation extends MLE by incorporating a prior belief about parameters. Instead of just maximizing the likelihood of data, MAP maximizes: $$P(\theta | X) \propto P(X | \theta) \cdot P(\theta)$$ This allows incorporating domain knowledge through the prior $P(\theta)$, making MAP a middle ground between pure Bayesian inference and point-estimate methods like MLE. Gibbs Sampling Gibbs sampling is a Markov chain Monte Carlo technique used primarily with energy-based models (such as Boltzmann machines). Rather than finding a single parameter value, it draws samples from the posterior distribution by iteratively sampling each variable conditioned on the current values of all others. This is powerful for complex models where direct optimization is intractable. Backpropagation of Reconstruction Error In autoencoders and similar architectures, standard backpropagation is used to minimize reconstruction error. The error signal flows backward through the network: the difference between original and reconstructed input is computed, and gradients are used to update weights to minimize this error. Clustering Algorithms Clustering is the task of grouping similar data points together without labeled information. Several fundamental approaches exist: K-Means Clustering K-means is perhaps the most widely used clustering algorithm. It partitions data into exactly $k$ clusters by: Initializing $k$ random cluster centers Assigning each point to the nearest center Updating each center to be the mean of its assigned points Repeating until convergence K-means minimizes within-cluster variance (the sum of squared distances from points to their cluster centers). The algorithm is simple and efficient, but requires you to specify $k$ in advance, and it can converge to local optima depending on initialization. Hierarchical Clustering Hierarchical clustering builds a tree (dendrogram) of clusters by recursively merging or splitting clusters. There are two main variants: Agglomerative: Start with each point as its own cluster, then repeatedly merge the two closest clusters Divisive: Start with all points in one cluster, then recursively split clusters Hierarchical clustering has the advantage that you don't need to specify the number of clusters—you can cut the tree at any level. However, it's computationally more expensive than K-means. Mixture Models Mixture models assume that data are generated from a mixture of probability distributions. For example, the data might come from a weighted combination of several Gaussian distributions: $$P(x) = \sum{k=1}^{K} \pik \cdot \mathcal{N}(x | \muk, \Sigmak)$$ where $\pik$ is the weight of the $k$-th distribution, and $\muk$ and $\Sigmak$ are its mean and covariance. Unlike K-means (which makes hard assignments), mixture models provide soft assignments—each point has a probability of belonging to each cluster. Learning mixture models typically uses the Expectation-Maximization (EM) algorithm, discussed below. Density-Based Clustering: DBSCAN and OPTICS DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters as connected dense regions in the data space. It doesn't require specifying $k$ in advance, and it naturally identifies outliers as points in sparse regions. However, DBSCAN struggles when clusters have varying densities. OPTICS (Ordering Points To Identify Clustering Structure) extends DBSCAN by ordering points in a way that reveals cluster structure at multiple density levels. Rather than finding a single clustering, OPTICS provides a hierarchical view of how cluster structure changes with density thresholds. Anomaly Detection Methods Anomaly detection identifies unusual or outlying points that don't fit the normal pattern of data. Local Outlier Factor (LOF) Local Outlier Factor measures how isolated a point is relative to its local neighbors. The key insight is that outliers are often points whose density is much lower than their neighbors' densities. LOF computes a score for each point based on the density of its $k$-nearest neighbors compared to its own density. Points with high LOF scores are anomalies. Isolation Forest Isolation Forest uses a different philosophy: anomalies are points that are easier to isolate than normal points. The algorithm recursively partitions the data space using random splits. Normal points require many splits to isolate, while anomalies can be isolated with fewer splits. This intuition makes Isolation Forest particularly efficient for high-dimensional data. Latent Variable Learning Latent variable models assume that observed data $X$ was generated from some unobserved (latent) variables $Z$. Learning involves discovering what $Z$ must be and what parameters govern their relationship. Expectation-Maximization (EM) The Expectation-Maximization algorithm iteratively estimates latent variables and model parameters through two steps: E-step (Expectation): Given current parameter estimates, compute the expected values of latent variables M-step (Maximization): Given expected latent variables, optimize parameters to maximize likelihood EM is guaranteed to improve the likelihood at each iteration, though it may converge to a local optimum. It's particularly valuable for mixture models, factor analysis, and other latent variable models where computing maximum likelihood directly is intractable. Method of Moments The method of moments takes a fundamentally different approach. It relates unknown model parameters directly to statistical moments of the observed data: The first-order moment is the mean vector: $\mathbb{E}[X]$ The second-order moment (when centered at zero) is the covariance matrix: $\mathbb{E}[XX^T]$ By relating model parameters to these moments, you can solve for parameters directly without iteration. This has a key advantage: method of moments avoids local optima—the solution is deterministic once you compute sample moments. The tradeoff is that method of moments may be less statistically efficient (requiring more data) than maximum likelihood methods, and it requires that the model's parameters can be expressed in terms of moments. Blind Signal Separation Techniques Blind signal separation refers to extracting independent signals from mixed observations without knowing the mixing process. Several techniques accomplish this: Principal Component Analysis (PCA) PCA finds directions of maximum variance in the data. It identifies a set of orthogonal directions (principal components) such that projecting data onto these directions maximizes variance. This is useful for dimensionality reduction and denoising. Independent Component Analysis (ICA) ICA extends PCA by finding components that are not just uncorrelated, but statistically independent. This is particularly powerful for problems like separating audio sources mixed in a room—the original sources are independent, even if their mixture is correlated. Non-Negative Matrix Factorization (NMF) NMF decomposes data into non-negative factors. This constraint is meaningful when data are naturally non-negative (like image pixel intensities or word counts) and you want interpretable components that can be meaningfully added together. Singular Value Decomposition (SVD) SVD decomposes a data matrix into three components: $X = U \Sigma V^T$. This provides the optimal low-rank approximation of the data and is the mathematical foundation for PCA. <extrainfo> Note on comparing EM and method of moments: While EM can become trapped in local optima and has no guarantee of finding true parameters, it often achieves better statistical efficiency and can be more flexible in handling complex models. Method of moments, conversely, provides a direct solution without iteration, avoiding local optima entirely, but may require stronger assumptions about what moments tell us about parameters. </extrainfo> Summary: Unsupervised learning encompasses a rich set of classical principles (Hebbian learning, SOMs) and modern algorithms (clustering, anomaly detection, latent variable learning) that discover structure in unlabeled data. These methods differ in their assumptions about data generation, computational efficiency, and the types of structure they uncover. Understanding when and how to apply each is essential for working with real-world data.
Flashcards
What is the fundamental principle of Hebbian learning regarding neurons?
Neurons that fire together wire together.
On what basis does Hebbian learning reinforce connections between neurons?
The coincidence of action potentials.
What kind of structure does a Self-Organizing Map (SOM) create to represent inputs with similar properties?
A topographic map.
By what metric do unsupervised networks typically adjust their weights?
Reconstruction error.
How does hierarchical clustering build its structure of clusters?
By recursive merging or splitting to build a tree.
By what mathematical criterion does K-means partition data into clusters?
Minimizing within-cluster variance.
What is the underlying assumption of mixture models regarding data generation?
Data are generated from a mixture of probability distributions.
How does DBSCAN distinguish between clusters and noise?
It identifies dense regions of points and separates sparse noise.
In what way does the OPTICS algorithm extend the capabilities of DBSCAN?
It orders points to reveal cluster structure at multiple density levels.
What does the Local Outlier Factor (LOF) measure to identify anomalies?
The degree to which a point is isolated relative to its neighbors.
Why does an Isolation Forest partition data recursively?
To isolate anomalies more quickly than normal points.
What two components does the expectation–maximization (EM) algorithm iteratively estimate?
Hidden variables and model parameters.
What is the first-order moment in the Method of Moments?
The mean vector.
When the mean is zero, what does the second-order moment represent?
The covariance matrix.
What is a significant limitation of the expectation–maximization algorithm compared to the Method of Moments?
It can get trapped in local optima and lacks guaranteed convergence to true parameters.

Quiz

What does Hebb’s principle assert about neuronal connections?
1 of 8
Key Concepts
Neural Learning and Networks
Hebbian learning
Self‑organizing map
Clustering and Dimensionality Reduction
K‑means clustering
DBSCAN
Principal component analysis
Statistical Inference and Sampling
Variational inference
Gibbs sampling
Expectation–maximization algorithm
Isolation forest
Method of moments