Computation of the gradient for SNE

Many methods exist to visualize high-dimensional data through a two-dimensional map. Those include linear techniques such as PCA and MDS; as well as nonlinear ones such as Isomap, LLE, SNE and t-SNE (resp. Principal Component Analysis, 1933; MultiDimensional Scaling, 1952; Isomap, 2000; Locally Linear Embedding, 2000; Stochastic Neighbor Embedding, 2002; t-Distributed Stochastic Neighbor Embedding, 2008). Some of those dimensionality reduction methods are illustrated in this sklearn example.

A popular method for nonlinear mapping is t-SNE. This method has been developed by Laurens van der Maaten and Geoffrey Hinton in [1]. It is based on SNE and improves it by addressing the “crowding problem” (tendency of mapped points to aggregate into a unique central cluster). You can familiarize yourself with t-SNE here, which allows exploration of various examples interactively.

In this post, we propose to derive gradient of the SNE algorithm (not to be confused with t-SNE, for which gradient calculation is detailed in Appendix A of [1]). SNE gradient is given in both original and t-SNE article, but neither detailed (see Equation 5 of [2], and Section 2 of [1]).

Intro illustration with cost and its gradient for SNE

In the following, we describe how works SNE (which is essentially a rewriting of Section 2 of [1]) before deriving SNE gradient step by step.

How works SNE

High-dimensional points

We have points is an high-dimensional space. We can measure distance between two points and with .

We define the similarity between points and as a conditional probability:

represents the probability “that would pick as its neighbor, if neighbors were picked in proportion to their probability density under a Gaussian centered at ”. We let .

The method for determining is presented in [1] and is related with the perplexity parameter.

Low-dimensional mapping

We wish to obtain belonging to a two-dimensional space, where each represents a point in the plane corresponding to .

We define the similarity between points and in this space:

We let .

What is a good map from to ?

A good map from to is a map for which distributions and are equal (for all ).

We measure the distance between distributions and with the Kullback-Leibler divergence:

[See this post for more details].

Since we want to minimize them for all , we finally define the cost function as:

Asymmetry of

We remember that divergence is asymmetric (see Fig 3.6. from Deep learning book p.74). In SNE, we focus on choosing a distribution which fairly represents the whole distribution , so we optimize .

This means:

  • If we use widely separated 2D points to represent nearby datapoints, there will be a large cost (i.e. we have a small to model a large , i.e. the points are far away in but close in ),
  • If we use nearby 2D points to represent widely separated datapoints, there will be a small cost (i.e. we have a large to model a small , i.e. the points are close in but far away in ).

On the whole, “the SNE cost function focuses on retaining the local structure of the data in the map”.

Gradient descent

The cost function is optimized with a variant of the classic gradient descent (see [1] for details).

Derivation of the SNE gradient

We have using previous notations:

Taking the gradient, we want to compute:

Decomposition into 3 terms

We separate the calculation as a sum of 3 terms:

where is when ; is when ; and is the remainder term:

Calculation of and

We define for all :

We have, using :

Case :

We have


Case :

We have


Case :

We have


Calculation of

But so:

and , then:

Calculation of

Since , we have:

But , so:

Calculation of

Since , we have:

But , so:

We have: , so finally:

Final form of the gradient

So after simplification:

and finally:

The form of this gradient is interpreted in Section 2 of [1].


  • [1]: Visualizing Data using t-SNE (2008) by Laurens van der Maaten and Geoffrey Hinton,
  • [2]: Stochastic Neighbor Embedding (2002) by Geoffrey Hinton and Sam Roweis.
Written on May 11, 2017