View article

[PDF] from academia.edu

OXPATH: A Language for Scalable Data Extraction, Automation, and Crawling on the Deep Web

Authors

Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Sellers

Publication date

2013/2

Journal

The VLDB Journal

Volume

Issue

1 - Special issue on best papers of VLDB 2011

Pages

47-72

Publisher

Springer Berlin / Heidelberg

Description

The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the …

Total citations

Cited by 125

20122013201420152016201720182019202020212022202320241 12 13 17 19 7 17 12 8 5 3 8 1

Scholar articles

OXPath: A language for scalable data extraction, automation, and crawling on the deep web

T Furche, G Gottlob, G Grasso, C Schallhart, A Sellers - The VLDB Journal, 2013