Subjects/Technology/Data and AI/Machine Learning/Unsupervised learning

Introduction to Unsupervised Learning

Understand the basics of unsupervised learning, key clustering and dimensionality‑reduction techniques, and their practical applications versus supervised methods.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

How is the data used in unsupervised learning characterized in terms of labels?

1 of 10

Summary

Fundamentals of Unsupervised Learning What is Unsupervised Learning? Unsupervised learning represents a distinct approach to machine learning where we work with data that has no explicit labels or target outcomes. Unlike supervised learning where we're given input-output pairs and asked to learn the relationship between them, unsupervised learning algorithms receive only raw data and must discover structure, patterns, and relationships entirely on their own. Think of it this way: in supervised learning, we're teaching a student by showing them problems and their correct solutions. In unsupervised learning, we're handing a student a collection of books and asking them to organize them into meaningful categories without any guidance on how the categories should be defined. Why Use Unsupervised Learning? The primary motivation for unsupervised learning is data exploration. Real-world data often contains hidden relationships and structure that are not immediately obvious from casual observation. Unsupervised learning helps reveal these patterns, which can then be used for: Understanding data structure before building more complex models Feature engineering to create better inputs for supervised learning Anomaly detection by identifying points that don't fit normal patterns Data compression by finding compact representations of high-dimensional information The diagram above shows the relationship between supervised and unsupervised approaches. Unsupervised methods like "imagine pictures," "generate videos," and "model languages" work directly with raw inputs, while supervised methods like "question & answer" and "analyze sentiments" require labeled training pairs. Clustering Techniques The Clustering Problem Clustering is the most common unsupervised learning task. The goal is to partition data into groups of similar items where items within the same group are more similar to each other than to items in other groups. Crucially, we don't define these groups in advance—the algorithm discovers them. A key challenge in clustering is that there is no single "correct" answer. If you ask three different domain experts to cluster the same dataset, they might create three different groupings, and all could be valid depending on the context and what aspects of similarity matter most. This is fundamentally different from supervised learning. k-Means Clustering k-Means is one of the simplest and most widely used clustering algorithms. The core idea is straightforward: Initialize: Choose $k$ cluster centers (called centroids) randomly from the data Assign: Each data point is assigned to the nearest centroid Update: Move each centroid to the center (mean) of all points assigned to it Repeat: Steps 2 and 3 until the centroids stop moving significantly The algorithm seeks to minimize the total distance from each point to its assigned centroid, which encourages the formation of compact, roughly spherical clusters. Key limitation: k-Means assumes clusters are roughly spherical in shape. If your true clusters have elongated or irregular shapes, k-Means may not discover them well. Additionally, you must specify $k$ in advance—there's no built-in way for the algorithm to determine the "right" number of clusters. Hierarchical Clustering Hierarchical clustering takes a different approach: it builds a tree of nested clusters called a dendrogram. Rather than producing a single partition of the data, hierarchical clustering shows how data can be grouped at multiple levels of granularity. The algorithm works in two main strategies: Agglomerative (bottom-up): Start with each point in its own cluster, then repeatedly merge the two most similar clusters until you have one giant cluster Divisive (top-down): Start with all data in one cluster, then repeatedly split clusters until each point is alone Hierarchical clustering is flexible because you can "cut" the dendrogram at different heights to get different numbers of clusters. However, it's more computationally expensive than k-Means and requires defining how to measure similarity between entire clusters (which is not always obvious). Evaluating Clustering Results Since clustering has no objectively "correct" answer, evaluation depends on whether the discovered groups align with domain knowledge and business objectives. Questions to ask include: Do the clusters make intuitive sense to domain experts? Do the clusters help solve the downstream problem we care about (e.g., customer segmentation, document organization)? Are the clusters stable (would the algorithm find similar clusters on slightly different data)? Are there natural "gaps" in the data that suggest clusters truly exist? Some algorithms compute internal metrics like silhouette score (how well-separated clusters are) or Davies-Bouldin index (a ratio of within-cluster to between-cluster distances), but these metrics measure statistical properties, not whether the clusters are useful. Dimensionality Reduction Techniques Why Reduce Dimensions? High-dimensional data (data with many features) presents several challenges: Visualization is impossible: We can only naturally visualize 2D or 3D data, making it hard to understand high-dimensional patterns Noise becomes pronounced: In high dimensions, the "curse of dimensionality" means distances become less meaningful, and noise overwhelms signal Computational burden: Algorithms slow down when processing many features Redundancy: Real-world data often has correlated features that represent the same underlying information Dimensionality reduction addresses these problems by finding a compact, lower-dimensional representation that preserves the most important information from the original data. The goal is to remove noise and redundancy while retaining the essential structure. Principal Component Analysis (PCA) Principal Component Analysis is perhaps the most fundamental dimensionality reduction technique. Here's the key insight: PCA searches for new features that capture the maximum variance in the data. These new features are: Orthogonal: They are perpendicular to each other (uncorrelated) Ordered by importance: The first principal component captures the most variance, the second captures the most remaining variance, and so on To understand this intuitively: imagine a scatter plot of data points. If the points are elongated in one direction, PCA rotates the coordinate system so the first axis points along the direction of maximum spread. The second axis points along the next direction of maximum spread (perpendicular to the first), and so on. By keeping only the first few principal components, we dramatically reduce the number of features while retaining most of the meaningful variation in the data. The components we drop are assumed to capture mostly noise. When it works well: PCA is excellent when the data's meaningful variation is concentrated in a few directions (i.e., it lies on a low-dimensional structure within the high-dimensional space). When it struggles: PCA assumes the important variation is captured by linear combinations of the original features. If the true structure is highly nonlinear, PCA may require many components to capture it. t-Distributed Stochastic Neighbor Embedding (t-SNE) While PCA aims to preserve global structure, t-SNE focuses on preserving local structure. It converts similarities between high-dimensional data points into distances in a low-dimensional space. The algorithm works by: Computing which points are "neighbors" in high-dimensional space (nearby points are similar) Creating a low-dimensional representation where those same neighborly relationships hold Using a special probability distribution that emphasizes local clustering while allowing distant points to spread apart Key strength: t-SNE creates visually striking maps where clusters and local patterns become immediately apparent, making it excellent for exploratory visualization. Key limitation: t-SNE does not preserve global distances well. The distances between separate clusters in a t-SNE plot are not meaningful—only the local clustering matters. Additionally, t-SNE is computationally expensive and can be sensitive to parameter choices. Benefits of Dimensionality Reduction Visualization: Reducing high-dimensional data to 2D or 3D allows us to visualize complex patterns and relationships that would otherwise remain invisible. This is invaluable for exploratory data analysis. Noise Reduction and Speed: By removing redundant and noisy features, downstream machine learning algorithms run faster and often produce better results. A supervised classifier trained on compressed features often performs better than one trained on all original features, because irrelevant information is eliminated. Unsupervised vs. Supervised Learning Key Differences Understanding how unsupervised learning differs from supervised learning is essential: | Aspect | Supervised Learning | Unsupervised Learning | |--------|-------------------|----------------------| | Training Data | Input-output pairs $(x, y)$ where $y$ is the correct answer | Only input data $x$; no labels | | Objective | Learn to predict the output for new inputs | Discover structure and patterns in the data | | Correctness | Each prediction has a single objectively correct answer | No single "correct" answer; assessment is qualitative | | Evaluation | Accuracy, precision, recall, F1-score, etc. | Interpretability, usefulness for downstream tasks, domain alignment | The fundamental difference in objectives is crucial: supervised learning asks "given this input, what is the output?" Unsupervised learning asks "what structure exists in this data?" Why This Matters for Evaluation Because supervised learning has objectively correct answers, we can easily measure whether our model is right or wrong. We compute the percentage of predictions that match ground truth labels and call that accuracy. Unsupervised learning has no ground truth to compare against. We cannot simply count "correct" clusters. Instead, we must evaluate results based on whether they make sense to human experts and whether they help solve a practical problem. This makes unsupervised learning more subjective but also more closely tied to real business value—the discovered structure must actually be useful. Practical Applications of Unsupervised Learning Feature Engineering One of the most powerful applications of unsupervised learning is using it to create better features for supervised learning. For example: Apply PCA to a set of 100 correlated sensor measurements to create 10 uncorrelated principal components Train a supervised classifier on these 10 components instead of the original 100 The result often outperforms a classifier trained on all 100 original features, because noise is eliminated and the structure is simplified This two-stage approach—unsupervised feature discovery followed by supervised prediction—combines the strengths of both paradigms. Data Pre-Processing Before applying supervised algorithms, unsupervised methods prepare data: Removing redundancy: Dimensionality reduction eliminates correlated features Detecting outliers: Clustering can identify points that don't fit well in any cluster, suggesting they may be errors or anomalies Handling missing data: Understanding cluster structure can inform intelligent imputation strategies These preprocessing steps often dramatically improve downstream model performance. Data Compression By representing data in fewer dimensions, unsupervised learning enables efficient storage and transmission. For example: Compress high-resolution images using dimensionality reduction techniques Store large datasets in fewer bytes by keeping only the most important components Transmit compressed representations across bandwidth-limited networks This is especially valuable in settings like IoT devices, where storage and transmission are constrained. Summary Unsupervised learning is a complementary approach to supervised learning that reveals hidden structure in unlabeled data. The two primary techniques are: Clustering: Partitions data into groups of similar items (k-Means, Hierarchical) Dimensionality Reduction: Finds compact representations that preserve important information (PCA, t-SNE) Success in unsupervised learning depends not on matching ground truth labels, but on whether the discovered structure aligns with domain knowledge and solves practical problems. Unsupervised methods are invaluable for data exploration, feature engineering, and preprocessing in real-world machine learning pipelines.

Flashcards

How is the data used in unsupervised learning characterized in terms of labels?

It is without any explicit labels or target outcomes.

What is the algorithm's primary task when dealing with raw inputs in unsupervised learning?

Discovering structure, patterns, or useful representations.

How does unsupervised learning differ from supervised learning regarding data and objectives?

Uses only input data without output labels (vs input-output pairs) Uncovers latent structure (vs predicting output for new inputs) Has no single correct answer (vs a single correct answer per prediction) Evaluated by interpretability and usefulness (vs accuracy/precision metrics)

What is the basic definition of clustering algorithms?

Partitioning data into groups of similar items without predefined labels.

How does k-Means clustering partition data into $k$ spherical groups?

By iteratively updating cluster centroids and assigning points to the nearest centroid.

What structure does hierarchical clustering build to view data at different granularities?

A tree of nested clusters.

What is the primary goal of dimensionality reduction?

Seeking a compact representation of high-dimensional data while preserving important information.

How does dimensionality reduction benefit data visualization?

It enables viewing complex data in two or three dimensions to make patterns interpretable.

How does Principal Component Analysis (PCA) transform original variables?

Into a smaller set of new orthogonal features that capture maximum variance.

How does t-Distributed Stochastic Neighbor Embedding create a visual map?

By converting high-dimensional similarities into low-dimensional distances to preserve local structure.

Quiz

What shape of clusters does k‑Means clustering aim to create?

1 of 5

Key Concepts

Unsupervised Learning Techniques

Unsupervised learning

Clustering

k‑Means clustering

Hierarchical clustering

Dimensionality reduction

Principal component analysis

t‑Distributed Stochastic Neighbor Embedding

Feature engineering

Data Processing

Data compression

Supervised Learning

Supervised learning

Definitions

Unsupervised learning

A branch of machine learning that discovers patterns and structure in data without using labeled outcomes.

Clustering

The task of grouping similar data points together based on inherent similarities, without predefined categories.

k‑Means clustering

An iterative algorithm that partitions data into k spherical clusters by minimizing within‑cluster variance.

Hierarchical clustering

A method that builds a tree of nested clusters, allowing data to be examined at multiple levels of granularity.

Dimensionality reduction

Techniques that transform high‑dimensional data into a lower‑dimensional space while preserving essential information.

Principal component analysis

A statistical procedure that converts correlated variables into a set of orthogonal components capturing maximal variance.

t‑Distributed Stochastic Neighbor Embedding

A nonlinear embedding method that visualizes high‑dimensional data by preserving local neighbor relationships in low dimensions.

Supervised learning

A machine‑learning paradigm that trains models on input‑output pairs to predict outcomes for new inputs.

Feature engineering

The process of creating informative variables from raw data, often using unsupervised methods to capture hidden patterns.

Data compression

Reducing the size of data representations, frequently achieved through dimensionality‑reduction techniques.