# Computation of the gradient for SNE

Many methods exist to visualize high-dimensional data through a two-dimensional map. Those include linear techniques such as PCA and MDS; as well as nonlinear ones such as Isomap, LLE, SNE and t-SNE (resp. Principal Component Analysis, 1933; MultiDimensional Scaling, 1952; Isomap, 2000; Locally Linear Embedding, 2000; Stochastic Neighbor Embedding, 2002; t-Distributed Stochastic Neighbor Embedding, 2008). Some of those dimensionality reduction methods are illustrated in this sklearn example.

A popular method for nonlinear mapping is t-SNE. This method has been developed by Laurens van der Maaten and Geoffrey Hinton in [1]. It is based on SNE and improves it by addressing the “crowding problem” (tendency of mapped points to aggregate into a unique central cluster). You can familiarize yourself with t-SNE here, which allows exploration of various examples interactively.

In this post, we propose to derive gradient of the SNE algorithm (not to be confused with t-SNE, for which gradient calculation is detailed in Appendix A of [1]). SNE gradient is given in both original and t-SNE article, but neither detailed (see Equation 5 of [2], and Section 2 of [1]).

In the following, we describe how works SNE (which is essentially a rewriting of Section 2 of [1]) before deriving SNE gradient step by step.

## How works SNE

### High-dimensional points

We have points $(x_i)_i$ is an high-dimensional space. We can measure distance between two points $i$ and $j$ with $\| x_i - x_j \|$.

We define the similarity between points $i$ and $j$ as a conditional probability:

$p_{j \mid i}$ represents the probability “that $x_i$ would pick $x_j$ as its neighbor, if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$”. We let $p_{i \mid i}:=0$.

The method for determining $\sigma_i$ is presented in [1] and is related with the perplexity parameter.

### Low-dimensional mapping

We wish to obtain $(y_i)_i$ belonging to a two-dimensional space, where each $y_i$ represents a point in the plane corresponding to $x_i$.

We define the similarity between points $i$ and $j$ in this space:

We let $q_{i \mid i}:=0$.

### What is a good map from $(x_i)_i$ to $(y_i)_i$?

A good map from $(x_i)_i$ to $(y_i)_i$ is a map for which distributions $P_i: j \mapsto p_{j \mid i}$ and $Q_i: j \mapsto q_{j \mid i}$ are equal (for all $i$).

We measure the distance between distributions $P_i$ and $Q_i$ with the Kullback-Leibler divergence:

Since we want to minimize them for all $i$, we finally define the cost function as:

### Asymmetry of $\text{KL}$

We remember that $\text{KL}$ divergence is asymmetric (see Fig 3.6. from Deep learning book p.74). In SNE, we focus on choosing a distribution $Q_i$ which fairly represents the whole distribution $P_i$, so we optimize $\text{KL}(P_i \mid \mid Q_i)$.

This means:

• If we use widely separated 2D points to represent nearby datapoints, there will be a large cost (i.e. we have a small $q_{j \mid i}$ to model a large $p_{j \mid i}$, i.e. the points are far away in $Y$ but close in $X$),
• If we use nearby 2D points to represent widely separated datapoints, there will be a small cost (i.e. we have a large $q_{j \mid i}$ to model a small $p_{j \mid i}$, i.e. the points are close in $Y$ but far away in $X$).

On the whole, “the SNE cost function focuses on retaining the local structure of the data in the map”.

The cost function is optimized with a variant of the classic gradient descent (see [1] for details).

## Derivation of the SNE gradient

We have using previous notations:

Taking the gradient, we want to compute:

### Decomposition into 3 terms

We separate the calculation as a sum of 3 terms:

where $[\text{I}]$ is when $i=l$; $[\text{II}]$ is when $j=l$; and $[\text{III}]$ is the remainder term:

Calculation of $q_{j \mid i}$ and $\nabla_{y_l} q_{j \mid i}$

We define for all $k, l$:

We have, using $\nabla_{a} \| a - b \|^2 = 2(a-b)$:

Case $[\text{I}]$:

We have

and

Case $[\text{II}]$:

We have

and

Case $[\text{III}]$:

We have

and

### Calculation of $[\text{I}]$

But $q_{k \mid l} = \frac{f_k(y_l)}{\sum_{k \neq l} f_k(y_l)}$ so:

and $\sum_{j \neq l} p_{j \mid l} = 1$, then:

### Calculation of $[\text{II}]$

Since $S = f_i(y_l) + B$, we have:

But $q_{l \mid i} = \frac{f_i(y_l)}{S}$, so:

### Calculation of $[\text{III}]$

Since $S = f_i(y_l) + B$, we have:

But $q_{l \mid i} = \frac{f_i(y_l)}{S}$, so:

We have: $\sum_{j \neq i, l} p_{j \mid i} = 1 - p_{l \mid i}$, so finally:

### Final form of the gradient

So after simplification:

and finally:

The form of this gradient is interpreted in Section 2 of [1].

## References

• [1]: Visualizing Data using t-SNE (2008) by Laurens van der Maaten and Geoffrey Hinton,
• [2]: Stochastic Neighbor Embedding (2002) by Geoffrey Hinton and Sam Roweis.
Written on May 11, 2017