View article

[PDF] from vldb.org

Oxpath: A language for scalable, memory-efficient data extraction from web applications

Authors

Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Sellers

Publication date

2011

Journal

Proceedings of the VLDB Endowment

Volume

Issue

Pages

1016-1027

Description

The evolution of the web has outpaced itself: The growing wealth of information and the increasing sophistication of interfaces necessitate automated processing. Web automation and extraction technologies have been overwhelmed by this very growth.

To address this trend, we identify four key requirements of web extraction: (1) Interact with sophisticated web application interfaces, (2) Precisely capture the relevant data for most web extraction tasks, (3) Scale with the number of visited pages, and (4) Readily embed into existing web technologies.

We introduce OXPath, an extension of XPath for interacting with web applications and for extracting information thus revealed. It addresses all the above requirements. OXPath's page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We validate experimentally the theoretical complexity and …

Total citations

Cited by 43

20112012201320142015201620172018201920203 12 7 6 3 2 3 2 1 3

Scholar articles

Oxpath: A language for scalable, memory-efficient data extraction from web applications

T Furche, G Gottlob, G Grasso, C Schallhart, A Sellers - Proceedings of the VLDB Endowment, 2011