Name		Name	Last commit message	Last commit date
parent directory ..
images		images
README.md		README.md
credit_card_fraud_t-SNE.R		credit_card_fraud_t-SNE.R
digits_t-SNE.py		digits_t-SNE.py

README.md

t-SNE (t-distributed Stochastic Neighbor Embedding)

To map/reduce high-dimensional data to a lower-dimensional (2D or 3D) space for visualization, while preserving and embedding significant structure in the original data.

Key Concepts

Term	Definition
Similarity	* The similarity of datapoint (x_J) to (x_I) is the conditional probability, p(J \| I); * That is, (x_I) would pick (x_J) as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian or Student's t-distribution centered at (x_I).
Perplexity	* A user-specified number of neighbors (a hyperparameter); * Usually ranging between (5, 50); * A balance between local and global aspects of the data

Implmentation

Steps	Description
#1	In the high-dimensional space, convert high-dimensional Euclidean distances to similarity under a Gaussian distribution, while constrained by the perplexity parameter
#2	In the low-dimensional space, compute the similarity between two points under a Student's t-distribution (df=1) rather than a Gaussian, to avoid the "crowding problem"
#3	* To reduce the mismatch between the corresponding high and low dimension conditional probabilities, minimize the sum of Kullback-Leibler (KL) divergences over all datapoints using a gradient descent method; * The cost function favors retaining nearby map points, preserving local structure in the data.

Example 1: Credit Card Transactions

Using a dataset of credit card transactions, I performed my own variation of a R code.

The dataset has 284,807 transactions (492 fraud transactions and 284,315 legitimate transactions) and 29 feature variables. Using perplexity = 29, 1,000 iterations, and a balanced dataset (492 fraud transactions and 492 legitimate transactions), the t-SNE algorithm reduces the complicated higher-dimensional relationships between the 29 feature variables to a 2D space:

Note. t-SNE preserves much of the distinction between the two classes using the 29 feature variables without knowing the Class variable.

Example 2: The Digits Dataset

Using the digits dataset, I ran my modified version of t-SNE Python code. The results are as following:

Comparison

Algorithm	Implementation for visualization	Preserving structure in the original data
PCA (Principal component analysis)	1. Identifying a directional axis that results in the highest variance of the high-dimensional data (1st PC); 2. Linearly projecting the high dimensional dataset into a lower dimensional space as described by the 1st PC and its orthagonal direction; 3. Visualizing the transformed data in 2D.	As a linear dimension reduction technique, PCA is unable to preserve complex relationships between features in the original data.
t-SNE	1. Creating a 2D embedding of the high-dimensional data; 2. Selecting the points' locations in the 2D space by using a probability distribution proportional to a similarity measure of two data points in the original data; 3. Minimizing the KL divergence between the probability distribution of the 2D embedding and the original data.	* Able to well capture much of the local structure of the high-dimensional data; * Able to reveal global structure such as the presence of clusters at several scales.

References

Visualizing employees' attributes in the retail industry
Python example
UMAP, a technique arguably better than t-SNE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t-SNE

t-SNE

README.md

t-SNE (t-distributed Stochastic Neighbor Embedding)

Key Concepts

Implmentation

Example 1: Credit Card Transactions

Example 2: The Digits Dataset

Comparison

References

Files

t-SNE

Directory actions

More options

Directory actions

More options

Latest commit

History

t-SNE

Folders and files

parent directory

README.md

t-SNE (t-distributed Stochastic Neighbor Embedding)

Key Concepts

Implmentation

Example 1: Credit Card Transactions

Example 2: The Digits Dataset

Comparison

References