Differences

This shows you the differences between two versions of the page.

Link to this comparison view

hierarchical_soft_max [2017/07/19 15:19]
127.0.0.1 external edit
hierarchical_soft_max [2017/11/14 10:09] (current)
admin
Line 44: Line 44:
  
 Taking the consequence of this, by e.g. skipping the normalization term of the softmax, we get significant improvement in our NN training—and at no other cost than a few minutes of coding. The only drawback is the introduction of some new hyper-paramters,​ α, β, and the target values. However, these have been easy to choose, and we do not expect that a lot of tedious fine-tuning is required in the general case. Taking the consequence of this, by e.g. skipping the normalization term of the softmax, we get significant improvement in our NN training—and at no other cost than a few minutes of coding. The only drawback is the introduction of some new hyper-paramters,​ α, β, and the target values. However, these have been easy to choose, and we do not expect that a lot of tedious fine-tuning is required in the general case.
 +
 +https://​arxiv.org/​pdf/​1711.03953.pdf BREAKING THE SOFTMAX BOTTLENECK:
 +A HIGH-RANK RNN LANGUAGE MODEL
 +
 +We formulate language modeling as a matrix factorization problem, and show
 +that the expressiveness of Softmax-based models (including the majority of neural
 +language models) is limited by a Softmax bottleneck. Given that natural language
 +is highly context-dependent,​ this further implies that in practice Softmax
 +with distributed word embeddings does not have enough capacity to model natural
 +language. ​
 +
 + ​Specifically,​ we
 +introduce discrete latent variables into a recurrent language model, and formulate the next-token
 +probability distribution as a Mixture of Softmaxes (MoS). Mixture of Softmaxes is more expressive
 +than Softmax and other surrogates considered in prior work. Moreover, we show that MoS learns
 +matrices that have much larger normalized singular values and thus much higher rank than Softmax
 +and other baselines on real-world datasets.