chainer.functions.negative_sampling(x, t, W, sampler, sample_size, reduce='sum')[source]

Negative sampling loss function.

In natural language processing, especially language modeling, the number of words in a vocabulary can be very large. Therefore, you need to spend a lot of time calculating the gradient of the embedding matrix.

By using the negative sampling trick you only need to calculate the gradient for a few sampled negative examples.

The objective function is below:

\[f(x, p) = \log \sigma(x^\top w_p) + \ k E_{i \sim P(i)}[\log \sigma(- x^\top w_i)],\]

where \(\sigma(\cdot)\) is a sigmoid function, \(w_i\) is the weight vector for the word \(i\), and \(p\) is a positive example. It is approximated with \(k\) examples \(N\) sampled from probability \(P(i)\), like this:

\[f(x, p) \approx \log \sigma(x^\top w_p) + \ \sum_{n \in N} \log \sigma(-x^\top w_n).\]

Each sample of \(N\) is drawn from the word distribution \(P(w)\). This is calculated as \(P(w) = \frac{1}{Z} c(w)^\alpha\), where \(c(w)\) is the unigram count of the word \(w\), \(\alpha\) is a hyper-parameter, and \(Z\) is the normalization constant.

  • x (Variable) – Batch of input vectors.
  • t (Variable) – Vector of ground truth labels.
  • W (Variable) – Weight matrix.
  • sampler (FunctionType) – Sampling function. It takes a shape and returns an integer array of the shape. Each element of this array is a sample from the word distribution. A WalkerAlias object built with the power distribution of word frequency is recommended.
  • sample_size (int) – Number of samples.
  • reduce (str) – Reduction option. Its value must be either 'sum' or 'no'. Otherwise, ValueError is raised.

A variable holding the loss value(s) calculated by the above equation. If reduce is 'no', the output variable holds array whose shape is same as one of (hence both of) input variables. If it is 'sum', the output variable holds a scalar value.

Return type:


See: Distributed Representations of Words and Phrases and their Compositionality

See also