Computation of the gradient for SNE
Many methods exist to visualize high-dimensional data through a two-dimensional map. Those include linear techniques such as PCA and MDS; as well as nonlinear ones such as Isomap, LLE, SNE and t-SNE (resp. Principal Component Analysis, 1933; MultiDimensional Scaling, 1952; Isomap, 2000; Locally Linear Embedding, 2000; Stochastic Neighbor Embedding, 2002; t-Distributed Stochastic Neighbor Embedding, 2008). Some of those dimensionality reduction methods are illustrated in this sklearn example.
A popular method for nonlinear mapping is t-SNE. This method has been developed by Laurens van der Maaten and Geoffrey Hinton in . It is based on SNE and improves it by addressing the “crowding problem” (tendency of mapped points to aggregate into a unique central cluster). You can familiarize yourself with t-SNE here, which allows exploration of various examples interactively.
In this post, we propose to derive gradient of the SNE algorithm (not to be confused with t-SNE, for which gradient calculation is detailed in Appendix A of ). SNE gradient is given in both original and t-SNE article, but neither detailed (see Equation 5 of , and Section 2 of ).
In the following, we describe how works SNE (which is essentially a rewriting of Section 2 of ) before deriving SNE gradient step by step.
How works SNE
We have points is an high-dimensional space. We can measure distance between two points and with .
We define the similarity between points and as a conditional probability:
represents the probability “that would pick as its neighbor, if neighbors were picked in proportion to their probability density under a Gaussian centered at ”. We let .
The method for determining is presented in  and is related with the perplexity parameter.
We wish to obtain belonging to a two-dimensional space, where each represents a point in the plane corresponding to .
We define the similarity between points and in this space:
We let .
What is a good map from to ?
A good map from to is a map for which distributions and are equal (for all ).
We measure the distance between distributions and with the Kullback-Leibler divergence:
[See this post for more details].
Since we want to minimize them for all , we finally define the cost function as:
We remember that divergence is asymmetric (see Fig 3.6. from Deep learning book p.74). In SNE, we focus on choosing a distribution which fairly represents the whole distribution , so we optimize .
- If we use widely separated 2D points to represent nearby datapoints, there will be a large cost (i.e. we have a small to model a large , i.e. the points are far away in but close in ),
- If we use nearby 2D points to represent widely separated datapoints, there will be a small cost (i.e. we have a large to model a small , i.e. the points are close in but far away in ).
On the whole, “the SNE cost function focuses on retaining the local structure of the data in the map”.
The cost function is optimized with a variant of the classic gradient descent (see  for details).
Derivation of the SNE gradient
We have using previous notations:
Taking the gradient, we want to compute:
Decomposition into 3 terms
We separate the calculation as a sum of 3 terms:
where is when ; is when ; and is the remainder term:
Calculation of and
We define for all :
We have, using :
and , then:
Since , we have:
But , so:
Since , we have:
But , so:
We have: , so finally:
Final form of the gradient
So after simplification:
The form of this gradient is interpreted in Section 2 of .