Dataset of 28x28 pixel images of handwritten digits
MNIST Dataset
Every MNIST image can be thought of as a 28x28 array of numbers describing how dark each pixel is:
We can flatten each array into a 28 * 28 = 784 dimensional vector, where each component of the vector is a value between zero and one describing the intensity of the pixel
We can think of MNIST as a collection of 784-dimensional vectors
MNIST Dataset
But not all vectors in this 784-dimensional space are MNIST digits! Typical points in this space are very different
To get a sense of what a typical point looks like, we can randomly pick a few random 28x28 images – each pixel is randomly black, white or some shade of gray. These random points look like noise:
MNIST Dataset
28x28 images that look like MNIST digits are very rare - they make up a very small subspace of 784-dimensional space
With some slightly harder arguments, we can see that they occupy a lower (than 748) dimensional subspace
Many theories about lower-dimensional structure of MNIST (and similar data)
Manifold hypothesis (popular among ML researchers): MNIST is a low dimensional manifold curving through its high-dimensional embedding space
Another hypothesis (rooted in topological data analysis) is that data like MNIST consists of blobs with tentacle-like protrusions sticking out into the surrounding space
But no one actually knows for sure!
MNIST Cube
Imagine the MNIST data points as points suspended in a 784-dimensional cube
Each dimension of the cube corresponds to a particular pixel
The data points range from zero to one according to pixel intensity
On one side of the dimension, there are images where that pixel is white. On the other side of the dimension, there are images where it is black. In between, there are images where it is gray.
What does this cube look like if we look at a particular two-dimensional face? Let's look at @Olah_2014's visualizations.
MNIST Cube
What qualities would the ‘perfect’ visualization of MNIST have? What should our goal be?
What is the best way to cluster MNIST data?
We can try what we learned last week…
Principal Component Analysis Recap
Dimensionality reduction technique that allows us to find clusters of similar data points based on many features (which we boil down to two “principal components” PC1 and PC2)
PC1 and PC2 represent the directions in the data space with the highest and second-highest variances, respectively
Way to bring out strong patterns from large and complex datasets
Common theme is to use a cost function that emphasizes local structure as more important to maintain than global structure
One version of this is Sammon's Mapping, which uses the cost function:
In Sammon’s mapping, we try harder to preserve the distances between nearby points than between those which are far apart. If two points are twice as close in the original space as two others, it is twice as important to maintain the distance between them.
Sparse Random Projection
Reduces the dimensionality by projecting the original input space using a sparse random matrix
Alternative to dense Gaussian random projection matrix - guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data
If we define s = 1 / density, the elements of the random matrix are drawn from
where ncomponents is the size of the projected subspace. <li style=“font-size:30px”;>By default the density of nonzero elements is set to the minimum density as recommended by Ping Li et al: 1 / (nfeatures)
Sparse Random Projection on MNIST
srp=SparseRandomProjection(n_components=2, random_state=42)X_srp=srp.fit_transform(X)plot_clustering(X_srp, "", "Sparse Random Projection")
Sparse Random Projection on MNIST
srp_3d=SparseRandomProjection(n_components=3, random_state=42)X_srp_3d=srp_3d.fit(X)X_srp_3d=srp_3d.fit_transform(X)plot_3d_clustering(X_srp_3d, "", "Sparse Random Projection")
Locally Linear Embedding
First we find the k-nearest neighbors of each data point
One advantage of the LLE algorithm is that there is only one parameter to tune (K). If K is chosen to be too small or too large, it will not be able to accomodate the geometry of the original data. Here, for each data point that we have we compute the K nearest neighbours.
Next, we approximate each data vector as a weighted linear combination of its k-nearest neighbors to construct new points. We try to minimize the cost function, where j’th nearest neighbour for point Xi.
Locally Linear Embedding
Finally, it computes the weights that best reconstruct the vectors from its neighbors, then produce the low-dimensional vectors best reconstructed by these weights.
In other words, we define the new vector space Y such that we minimize the cost for Y as the new points:
Tries to optimize for preserving the topology of the data
For every point, t-SNE constructs a notion of which other points are its ‘neighbors,’ trying to make all points have the same number of neighbors. Then it tries to embed them so that those points all have the same number of neighbors.
t-SNE plots are popular and can be useful, but only if you avoid common misreadings
Beware of hyperparameter values
Remember cluster sizes in a t-SNE plot mean nothing
Know that distances BETWEEN clusters may not mean anything either
Random noise doesn’t always look random
You may need multiple plots for topology
See Wattenberg, Viégas, and Johnson (2016) for more details
Uniform Manifold Approximation and Projection (UMAP)
Similar to t-SNE, but with some advantages:
Increased speed
Better preservation of the data’s global structure
Uniform Manifold Approximation and Projection (UMAP)
UMAP constructs a high dimensional graph representation of the data then optimizes a low-dimensional graph to be as structurally similar as possible
Starts with "fuzzy simplicial complex," a representation of a weighted graph with edge weights representing the likelihood that two points are connected.
To determine connectedness, UMAP extends a radius outwards from each point, connecting points when those radii overlap.
Choice of radius matters! Too small = small, isolated clusters. Too large = everything is connected. UMAP overcomes this challenge by choosing a radius locally, based on the distance to each point's nth nearest neighbor.
The graph is made "fuzzy" by decreasing the likelihood of connection as the radius grows
By stipulating that each point must be connected to at least its closest neighbor, UMAP ensures that local structure is preserved in balance with global structure
Olah, Christopher. 2014. “Visualizing Mnist: An Exploration of Dimensionality Reduction.”Visualizing MNIST: An Exploration of Dimensionality Reduction - Colah’s Blog. https://colah.github.io/posts/2014-10-Visualizing-MNIST/.