t-SNE (t-distributed Stochastic Neighbor Embedding)
To map/reduce high-dimensional data to a lower-dimensional (2D or 3D) space for visualization, while preserving and embedding significant structure in the original data.
Term | Definition |
---|---|
Similarity | * The similarity of datapoint (x_J) to (x_I) is the conditional probability, p(J | I); * That is, (x_I) would pick (x_J) as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian or Student's t-distribution centered at (x_I). |
Perplexity | * A user-specified number of neighbors (a hyperparameter); * Usually ranging between (5, 50); * A balance between local and global aspects of the data |
Steps | Description |
---|---|
#1 | In the high-dimensional space, convert high-dimensional Euclidean distances to similarity under a Gaussian distribution, while constrained by the perplexity parameter |
#2 | In the low-dimensional space, compute the similarity between two points under a Student's t-distribution (df=1) rather than a Gaussian, to avoid the "crowding problem" |
#3 | * To reduce the mismatch between the corresponding high and low dimension conditional probabilities, minimize the sum of Kullback-Leibler (KL) divergences over all datapoints using a gradient descent method; * The cost function favors retaining nearby map points, preserving local structure in the data. |
Using a dataset of credit card transactions, I performed my own variation of a R code.
The dataset has 284,807 transactions (492 fraud transactions and 284,315 legitimate transactions) and 29 feature variables. Using perplexity = 29, 1,000 iterations, and a balanced dataset (492 fraud transactions and 492 legitimate transactions), the t-SNE algorithm reduces the complicated higher-dimensional relationships between the 29 feature variables to a 2D space:
Note. t-SNE preserves much of the distinction between the two classes using the 29 feature variables without knowing the Class variable.
Using the digits dataset, I ran my modified version of t-SNE Python code. The results are as following:
Algorithm | Implementation for visualization | Preserving structure in the original data |
---|---|---|
PCA (Principal component analysis) | 1. Identifying a directional axis that results in the highest variance of the high-dimensional data (1st PC); 2. Linearly projecting the high dimensional dataset into a lower dimensional space as described by the 1st PC and its orthagonal direction; 3. Visualizing the transformed data in 2D. |
As a linear dimension reduction technique, PCA is unable to preserve complex relationships between features in the original data. |
t-SNE | 1. Creating a 2D embedding of the high-dimensional data; 2. Selecting the points' locations in the 2D space by using a probability distribution proportional to a similarity measure of two data points in the original data; 3. Minimizing the KL divergence between the probability distribution of the 2D embedding and the original data. |
* Able to well capture much of the local structure of the high-dimensional data; * Able to reveal global structure such as the presence of clusters at several scales. |
- Visualizing employees' attributes in the retail industry
- Python example
- UMAP, a technique arguably better than t-SNE