Authors
Fangbo Tao, Honglei Zhuang, Chi Wang Yu, Qi Wang, Taylor Cassidy, Lance M Kaplan, Clare R Voss, Jiawei Han
Publication date
2016/9
Journal
IEEE Data Eng. Bull.
Volume
39
Issue
3
Pages
74-84
Description
To systematically analyze large numbers of textual documents, it is often desirable to manage documents (and their metadata) in a multi-dimensional text database (Text Cube). Such structure provides flexibility of understanding local information with different granularities. Moreover, the contextualized analysis derived from cube structure often yields comparative insights. To quickly digest the content of subsets of documents in the multi-dimensional context, we study the problem of phrase-based summarization of a subset of documents of interest. We propose a new phrase ranking measure to leverage the relation between document subsets induced by multi-dimensional context and identify phrases that truly distinguish the queried subset of documents from neighboring subsets (ie, background). Our quality evaluation suggests the new measure involving dynamic, query-dependent background generation is more effective than previous measures using the whole corpus as a static background for finding representative phrases. Computing this measure is more expensive due to the need of access to many subsets of documents to answer one query. We develop a cube-based analytical platform that implements an efficient solution by materializing a deliberately selected part of statistics, and using these statistics to perform online query processing within a constant latency constraint. Our experiments in a large news dataset demonstrate the efficiency in both query processing time and storage cost.
Total citations
20152016201720182019202020212022202313786142
Scholar articles
F Tao, H Zhuang, CW Yu, Q Wang, T Cassidy… - IEEE Data Eng. Bull., 2016