Authors
Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Sellers
Publication date
2013/2
Journal
The VLDB Journal
Volume
22
Issue
1 - Special issue on best papers of VLDB 2011
Pages
47-72
Publisher
Springer Berlin / Heidelberg
Description
The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the …
Total citations
20122013201420152016201720182019202020212022202320241121317197171285381
Scholar articles