View article

[PDF] from cornell.edu

From sbow to dcot marginalized encoders for text representation

Authors

Zhixiang Xu, Minmin Chen, Kilian Q Weinberger, Fei Sha

Publication date

2012/10/29

Book

Proceedings of the 21st ACM international conference on Information and knowledge management

Pages

1879-1884

Description

In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF [1]). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from co-occurring infrequent …

Total citations

Cited by 28

20132014201520162017201820192020202120221 2 7 4 3 2 3 3 3

Scholar articles

From sbow to dcot marginalized encoders for text representation

Z Xu, M Chen, KQ Weinberger, F Sha - Proceedings of the 21st ACM international conference …, 2012