View article

[PDF] from psu.edu

Uncovering the unarchived web

Authors

Thaer Samar, Hugo C Huurdeman, Anat Ben-David, Jaap Kamps, Arjen De Vries

Publication date

2014/7/3

Book

Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

Pages

1199-1202

Description

Many national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection's (aura): the web documents that were not included in the archived collection, but are known to have existed --- due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived …

Total citations

Cited by 12

20142015201620172018201920203 2 2 3 1

Scholar articles

Uncovering the unarchived web

T Samar, HC Huurdeman, A Ben-David, J Kamps… - Proceedings of the 37th international ACM SIGIR …, 2014