Authors
Thaer Samar, Hugo C Huurdeman, Anat Ben-David, Jaap Kamps, Arjen De Vries
Publication date
2014/7/3
Book
Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval
Pages
1199-1202
Description
Many national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection's (aura): the web documents that were not included in the archived collection, but are known to have existed --- due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived …
Total citations
201420152016201720182019202032231
Scholar articles
T Samar, HC Huurdeman, A Ben-David, J Kamps… - Proceedings of the 37th international ACM SIGIR …, 2014