View article

[HTML] from springer.com

Lost but not forgotten: finding pages on the unarchived web

Authors

Hugo C Huurdeman, Jaap Kamps, Thaer Samar, Arjen P de Vries, Anat Ben-David, Richard A Rogers

Publication date

2015/9

Journal

International Journal on Digital Libraries

Volume

Pages

247-265

Publisher

Springer Berlin Heidelberg

Description

Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages …

Total citations

Cited by 26

201620172018201920202021202220235 4 7 4 1 1 2 1

Scholar articles

Lost but not forgotten: finding pages on the unarchived web

HC Huurdeman, J Kamps, T Samar, AP de Vries… - International Journal on Digital Libraries, 2015