View article

[PDF] from mit.edu

A latent variable model approach to pmi-based word embeddings

Authors

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Andrej Risteski

Publication date

2016/7/1

Journal

Transactions of the Association for Computational Linguistics

Volume

Pages

385-399

Publisher

MIT Press

Description

Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods. Many use nonlinear operations on co-occurrence statistics, and have hand-tuned hyperparameters and reweighting methods.

This paper proposes a new generative model, a dynamic version of the log-linear topic model of Mnih and Hinton (2007). The methodological novelty is to use the prior to compute closed form expressions for word statistics. This provides a theoretical justification for nonlinear models like PMI, word2vec, and GloVe, as well as some hyperparameter choices. It also helps explain why low-dimensional semantic embeddings contain linear algebraic structure that allows solution of word analogies, as shown by Mikolov et al. (2013a) and many subsequent papers …

Total citations

Cited by 641

201520162017201820192020202120222023202411 30 36 61 69 75 66 54 105 36

Scholar articles

A latent variable model approach to pmi-based word embeddings

S Arora, Y Li, Y Liang, T Ma, A Risteski - Transactions of the Association for Computational …, 2016

Random walks on context spaces: Towards an explanation of the mysteries of semantic word embeddings*

S Arora, Y Li, Y Liang, T Ma, A Risteski - arXiv preprint arXiv:1502.03520, 2015

Random walks on discourse spaces: a new generative language model with applications to semantic word embeddings. ArXiv e-prints*

S Arora, Y Li, Y Liang, T Ma, A Risteski - arXiv preprint arXiv:1502.03520, 2015

Cited by 1 Related articles