# Computation of the gradient for SNE

Many methods exist to visualize high-dimensional data through a two-dimensional map. Those include linear techniques such as PCA and MDS; as well as nonlinear ones such as Isomap, LLE, SNE and t-SNE (resp. Principal Component Analysis, 1933; MultiDimensional Scaling, 1952; Isomap, 2000; Locally Linear Embedding, 2000; Stochastic Neighbor Embedding, 2002; t-Distributed Stochastic Neighbor Embedding, 2008). Some of those dimensionality reduction methods are illustrated in this sklearn example.

A popular method for nonlinear mapping is t-SNE. This method has been developed by Laurens van der Maaten and Geoffrey Hinton in [1]. It is based on SNE and improves it by addressing the “crowding problem” (tendency of mapped points to aggregate into a unique central cluster). You can familiarize yourself with t-SNE here, which allows exploration of various examples interactively.

In this post, we propose to derive gradient of the SNE algorithm (not to be confused with t-SNE, for which gradient calculation is detailed in Appendix A of [1]). SNE gradient is given in both original and t-SNE article, but neither detailed (see Equation 5 of [2], and Section 2 of [1]).

In the following, we describe how works SNE (which is essentially a rewriting of Section 2 of [1]) before deriving SNE gradient step by step.

## How works SNE

### High-dimensional points

We have points is an high-dimensional space. We can measure distance between two points and with .

We define the similarity between points and as a conditional probability:

represents the probability “that would pick as its neighbor, if neighbors were picked in proportion to their probability density under a Gaussian centered at ”. We let .

The method for determining is presented in [1] and is related with the *perplexity* parameter.

### Low-dimensional mapping

We wish to obtain belonging to a two-dimensional space, where each represents a point in the plane corresponding to .

We define the similarity between points and in this space:

We let .

### What is a good map from to ?

A good map from to is a map for which distributions and are equal (for all ).

We measure the distance between distributions and with the Kullback-Leibler divergence:

[See this post for more details].

Since we want to minimize them for all , we finally define the cost function as:

### Asymmetry of

We remember that divergence is asymmetric (see Fig 3.6. from Deep learning book p.74). In SNE, we focus on choosing a distribution which fairly represents the whole distribution , so we optimize .

This means:

- If we use widely separated 2D points to represent nearby datapoints, there will be a large cost (i.e. we have a small to model a large , i.e. the points are far away in but close in ),
- If we use nearby 2D points to represent widely separated datapoints, there will be a small cost (i.e. we have a large to model a small , i.e. the points are close in but far away in ).

On the whole, “the SNE cost function focuses on retaining the local structure of the data in the map”.

### Gradient descent

The cost function is optimized with a variant of the classic gradient descent (see [1] for details).

## Derivation of the SNE gradient

We have using previous notations:

Taking the gradient, we want to compute:

### Decomposition into 3 terms

We separate the calculation as a sum of 3 terms:

where is when ; is when ; and is the remainder term:

**Calculation of and **

We define for all :

We have, using :

*Case :*

We have

and

*Case :*

We have

and

*Case :*

We have

and

### Calculation of

But so:

and , then:

### Calculation of

Since , we have:

But , so:

### Calculation of

Since , we have:

But , so:

We have: , so finally:

### Final form of the gradient

So after simplification:

and finally:

The form of this gradient is interpreted in Section 2 of [1].