View article

[PDF] from psu.edu

Crawling deep web entity pages

Authors

Yeye He, Dong Xin, Venkatesh Ganti, Sriram Rajaraman, Nirav Shah

Publication date

2013/2/4

Book

Proceedings of the sixth ACM international conference on Web search and data mining

Pages

355-364

Description

Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context …

Total citations

Cited by 113

2013201420152016201720182019202020212022202320244 4 17 25 17 6 9 11 8 4 6 1

Scholar articles

Crawling deep web entity pages

Y He, D Xin, V Ganti, S Rajaraman, N Shah - Proceedings of the sixth ACM international conference …, 2013