PCA and UMAP can take high-dimensional embedding vectors and squish them down into a lower, more manageable number of dimensions, making them easier to visualize and use in downstream tasks.
Let’s see this in action. Imagine we have some text embeddings. These are vectors representing the meaning of words or sentences. A common way to get these is using models like Sentence-BERT. Let’s say we have 1000 sentences, and each sentence is represented by a 768-dimensional vector. That’s a lot of dimensions to plot!
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
import umap
import numpy as np
import matplotlib.pyplot as plt
# 1. Generate some dummy sentence embeddings
# In a real scenario, you'd load your actual embeddings
sentences = [
"This is the first sentence.",
"This is the second sentence, quite similar.",
"A completely different topic altogether.",
"Another sentence about something else.",
"Yet another sentence, very much like the first.",
"The weather today is quite pleasant.",
"I enjoy walking in the park.",
"Machine learning is fascinating.",
"Deep learning is a subset of machine learning.",
"Natural language processing is key to understanding text."
] * 100 # Repeat to get 1000 sentences
# Load a pre-trained model (e.g., 'all-MiniLM-L6-v2' is fast and good)
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(f"Original embedding shape: {embeddings.shape}") # e.g., (1000, 384) for this model
# 2. Apply PCA
# We'll reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
embeddings_pca = pca.fit_transform(embeddings)
print(f"PCA embedding shape: {embeddings_pca.shape}")
# 3. Apply UMAP
# UMAP is often better for preserving local structure
# n_neighbors and min_dist are key parameters
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
embeddings_umap = reducer.fit_transform(embeddings)
print(f"UMAP embedding shape: {embeddings_umap.shape}")
# 4. Visualize the results
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(embeddings_pca[:, 0], embeddings_pca[:, 1], s=5)
plt.title('PCA Reduction (2D)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.subplot(1, 2, 2)
plt.scatter(embeddings_umap[:, 0], embeddings_umap[:, 1], s=5, c=embeddings_umap[:, 1], cmap='viridis')
plt.title('UMAP Reduction (2D)')
plt.xlabel('UMAP Component 1')
plt.ylabel('UMAP Component 2')
plt.tight_layout()
plt.show()
The core problem these techniques solve is the "curse of dimensionality." As the number of dimensions in your data increases, the volume of the space increases exponentially. This makes it much harder for algorithms to find meaningful patterns, as data points become increasingly sparse and equidistant from each other. High-dimensional data is also computationally expensive to process and impossible to visualize directly. PCA and UMAP offer a way to compress this information into a lower-dimensional space while trying to retain as much of the original data’s structure as possible.
PCA (Principal Component Analysis) is a linear method. It finds a new set of orthogonal axes (principal components) that capture the maximum variance in the data. The first principal component explains the most variance, the second explains the most remaining variance orthogonal to the first, and so on. When you reduce dimensions with PCA, you’re essentially projecting your data onto a subspace defined by the top k principal components. It’s like finding the best "flat" representation of your high-dimensional data. The trade-off is that it can miss non-linear relationships.
UMAP (Uniform Manifold Approximation and Projection) is a non-linear method. It’s based on manifold learning and topological data analysis. UMAP constructs a high-dimensional graph representing the relationships between data points and then optimizes a low-dimensional graph to be as structurally similar as possible to the high-dimensional one. It tries to preserve both local and global structure, often resulting in clusters that are more visually distinct and meaningful than those produced by PCA, especially for complex datasets like embeddings. The parameters n_neighbors and min_dist are crucial: n_neighbors controls how UMAP balances local versus global structure (smaller values focus more on local structure), and min_dist controls how tightly UMAP is allowed to pack points together in the low-dimensional space (smaller values allow for denser clusters).
The key difference in how they work internally is PCA’s reliance on covariance matrices and eigenvectors to find directions of maximum variance, treating the problem as finding a linear projection. UMAP, on the other hand, builds a fuzzy topological representation of the data in high dimensions and then seeks to find a low-dimensional embedding that best preserves this topology. This is why UMAP often excels at revealing cluster structures that PCA might smooth over.
A subtle but important point about UMAP is its probabilistic nature and sensitivity to the random seed. While random_state makes it reproducible, different random seeds can lead to slightly different (but often qualitatively similar) embeddings. This is because UMAP uses stochastic gradient descent to optimize the low-dimensional representation, and the initial state and path of this optimization can vary. It’s less about finding the single best low-dimensional representation and more about finding a good low-dimensional representation that captures the manifold’s structure.
The next step after reducing dimensions is often clustering, or using these lower-dimensional embeddings as features for a classifier.