View article

[PDF] from uva.nl

Finding pages on the unarchived web

Authors

Hugo C Huurdeman, Anat Ben-David, Jaap Kamps, Thaer Samar, Arjen P De Vries

Publication date

2014/9/8

Conference

IEEE/ACM Joint Conference on Digital Libraries

Pages

331-340

Publisher

IEEE

Description

Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies—most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the DutchWeb archive. Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of theWeb archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely …

Total citations

Cited by 15

20142015201620172018201920201 3 2 2 1 5 1

Scholar articles

Finding pages on the unarchived web

HC Huurdeman, A Ben-David, J Kamps, T Samar… - IEEE/ACM Joint Conference on Digital Libraries, 2014