Dataset of 28x28 pixel images of handwritten digits

MNIST Dataset

Every MNIST image can be thought of as a 28x28 array of numbers describing how dark each pixel is:

We can flatten each array into a 28 * 28 = 784 dimensional vector, where each component of the vector is a value between zero and one describing the intensity of the pixel

We can think of MNIST as a collection of 784-dimensional vectors

MNIST Dataset

But not all vectors in this 784-dimensional space are MNIST digits! Typical points in this space are very different

To get a sense of what a typical point looks like, we can randomly pick a few random 28x28 images – each pixel is randomly black, white or some shade of gray. These random points look like noise:

MNIST Dataset

28x28 images that look like MNIST digits are very rare - they make up a very small subspace of 784-dimensional space

With some slightly harder arguments, we can see that they occupy a lower (than 748) dimensional subspace

Many theories about lower-dimensional structure of MNIST (and similar data)

Manifold hypothesis (popular among ML researchers): MNIST is a low dimensional manifold curving through its high-dimensional embedding space

Another hypothesis (rooted in topological data analysis) is that data like MNIST consists of blobs with tentacle-like protrusions sticking out into the surrounding space

But no one actually knows for sure!

MNIST Cube

Imagine the MNIST data points as points suspended in a 784-dimensional cube

Each dimension of the cube corresponds to a particular pixel

The data points range from zero to one according to pixel intensity

On one side of the dimension, there are images where that pixel is white. On the other side of the dimension, there are images where it is black. In between, there are images where it is gray.

What does this cube look like if we look at a particular two-dimensional face? Let's look at @Olah_2014's visualizations.

MNIST Cube

What qualities would the ‘perfect’ visualization of MNIST have? What should our goal be?

What is the best way to cluster MNIST data?

We can try what we learned last week…

Principal Component Analysis Recap

Dimensionality reduction technique that allows us to find clusters of similar data points based on many features (which we boil down to two “principal components” PC1 and PC2)

PC1 and PC2 represent the directions in the data space with the highest and second-highest variances, respectively

Way to bring out strong patterns from large and complex datasets

Common theme is to use a cost function that emphasizes local structure as more important to maintain than global structure

One version of this is Sammon's Mapping, which uses the cost function:

In Sammon’s mapping, we try harder to preserve the distances between nearby points than between those which are far apart. If two points are twice as close in the original space as two others, it is twice as important to maintain the distance between them.

Sparse Random Projection

Reduces the dimensionality by projecting the original input space using a sparse random matrix

Alternative to dense Gaussian random projection matrix - guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data

If we define s = 1 / density, the elements of the random matrix are drawn from

where n_{components} is the size of the projected subspace. <li style=“font-size:30px”;>By default the density of nonzero elements is set to the minimum density as recommended by Ping Li et al: 1 / (n_{features})

Sparse Random Projection on MNIST

srp=SparseRandomProjection(n_components=2, random_state=42)X_srp=srp.fit_transform(X)plot_clustering(X_srp, "", "Sparse Random Projection")

Sparse Random Projection on MNIST

srp_3d=SparseRandomProjection(n_components=3, random_state=42)X_srp_3d=srp_3d.fit(X)X_srp_3d=srp_3d.fit_transform(X)plot_3d_clustering(X_srp_3d, "", "Sparse Random Projection")

Locally Linear Embedding

First we find the k-nearest neighbors of each data point

One advantage of the LLE algorithm is that there is only one parameter to tune (K). If K is chosen to be too small or too large, it will not be able to accomodate the geometry of the original data. Here, for each data point that we have we compute the K nearest neighbours.

Next, we approximate each data vector as a weighted linear combination of its k-nearest neighbors to construct new points. We try to minimize the cost function, where j’th nearest neighbour for point X_{i}.

Locally Linear Embedding

Finally, it computes the weights that best reconstruct the vectors from its neighbors, then produce the low-dimensional vectors best reconstructed by these weights.

In other words, we define the new vector space Y such that we minimize the cost for Y as the new points:

Tries to optimize for preserving the topology of the data

For every point, t-SNE constructs a notion of which other points are its ‘neighbors,’ trying to make all points have the same number of neighbors. Then it tries to embed them so that those points all have the same number of neighbors.

t-SNE plots are popular and can be useful, but only if you avoid common misreadings

Beware of hyperparameter values

Remember cluster sizes in a t-SNE plot mean nothing

Know that distances BETWEEN clusters may not mean anything either

Random noise doesn’t always look random

You may need multiple plots for topology

See Wattenberg, Viégas, and Johnson (2016) for more details

Uniform Manifold Approximation and Projection (UMAP)

Similar to t-SNE, but with some advantages:

Increased speed

Better preservation of the data’s global structure

Uniform Manifold Approximation and Projection (UMAP)

UMAP constructs a high dimensional graph representation of the data then optimizes a low-dimensional graph to be as structurally similar as possible

Starts with "fuzzy simplicial complex," a representation of a weighted graph with edge weights representing the likelihood that two points are connected.

To determine connectedness, UMAP extends a radius outwards from each point, connecting points when those radii overlap.

Choice of radius matters! Too small = small, isolated clusters. Too large = everything is connected. UMAP overcomes this challenge by choosing a radius locally, based on the distance to each point's n^{th} nearest neighbor.

The graph is made "fuzzy" by decreasing the likelihood of connection as the radius grows

By stipulating that each point must be connected to at least its closest neighbor, UMAP ensures that local structure is preserved in balance with global structure

Olah, Christopher. 2014. “Visualizing Mnist: An Exploration of Dimensionality Reduction.”Visualizing MNIST: An Exploration of Dimensionality Reduction - Colah’s Blog. https://colah.github.io/posts/2014-10-Visualizing-MNIST/.