Spring 2024
HERE is an excellent talk by t-SNE creator: video link
https://distill.pub/2016/misread-tsne/
Wattenberg et al writes “A popular method for exploring high-dimensional data is something called t-SNE… it has an almost magical ability to create compelling two-dimensonal “maps” from data with hundreds or even thousands of dimensions. Although impressive, these images can be tempting to misread.”
Wattenberg: “The algorithm is non-linear and adapts to the underlying data, performing different transformations on different regions. Those differences can be a major source of confusion.”
Watternberg: “A second feature of t-SNE is a tuneable parameter, “perplexity,” which says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures.”
“By size we mean bounding box measurements, not number of points.”
“The t-SNE algorithm adapts its notion of “distance” to regional density variations in the data set. As a result, it naturally expands dense clusters, and contracts sparse ones, evening out cluster sizes.”
“Distances between clusters might not mean anything”
“The next diagrams show three Gaussians of 50 points each, one pair being 5 times as far apart as another pair.”
“Random noise doesn’t always look random.”
“The next diagrams show genuinely random data, 500 points drawn from a unit Gaussian distribution in 100 dimensions. The left image is a projection onto the first two coordinates.”
“For topology, you may need more than one plot”
“The plots below show two groups of 75 points in 50 dimensional space. Both are sampled from symmetric Gaussian distributions centered at the origin, but one is 50 times more tightly dispersed than the other. The “small” distribution is in effect contained in the large one.”
https://pair-code.github.io/understanding-umap/
n_neighbors - the number of approximate nearest neighbors used to construct the initial high-dimensional graph.
min_dist - the minimum distance between points in low-dimensional space.
n_neighbors - the number of approximate nearest neighbors used to construct the initial high-dimensional graph.
min_dist - the minimum distance between points in low-dimensional space.
n_neighbors - the number of approximate nearest neighbors used to construct the initial high-dimensional graph.
min_dist - the minimum distance between points in low-dimensional space.
“However, it’s important to note that, because UMAP and t-SNE both necessarily warp the high-dimensional shape of the data when projecting to lower dimensions, any given axis or distance in lower dimensions still isn’t directly interpretable in the way of techniques such as PCA.”
Suggested reading: https://pair-code.github.io/understanding-umap/supplement.html