View article

[PDF] from psu.edu

Turn the Page: Automated Traversal of Paginated Websites⋆

Authors

Tim Furche, Giovanni Grasso, Andrey Kravchenko, Christian Schallhart

Publication date

2012

Conference

International Conference on Web Engineering ICWE2012

Description

Content-intensive web sites, such as Google or Amazon, paginate their results to accommodate limited screen sizes. Thus, human users and automatic tools alike have to traverse the pagination links when they crawl the site, extract data, or automate common tasks, where these applications require access to the entire result set. Previous approaches, as well as existing crawlers and automation tools, rely on simple heuristics (e.g., considering only the link text), falling back to an exhaustive exploration of the site where those heuristics fail. In particular, focused crawlers and data extraction systems target only fractions of the individual pages of a given site, rendering a highly accurate identification of pagination links essential to avoid the exhaustive exploration of irrelevant pages.

We identify pagination links in a wide range of domains and sites with near perfect accuracy (99%). We obtain these results with a …

Total citations

Cited by 14

2012201320142015201620172018201920201 1 3 2 2 3 1

Scholar articles

Turn the page: automated traversal of paginated websites

T Furche, G Grasso, A Kravchenko, C Schallhart - International Conference on Web Engineering, 2012