Authors
Thaer Samar, Myriam C Traub, Jacco van Ossenbruggen, Arjen P de Vries
Publication date
2016
Conference
Research and Advanced Technology for Digital Libraries: 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Hannover, Germany, September 5–9, 2016, Proceedings 20
Pages
133-146
Publisher
Springer International Publishing
Description
Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB) using a depth-first strategy on manually selected websites from the .nl domain, with the goal to crawl websites as completes as possible. The second crawl was collected by the Common Crawl foundation using a breadth-first strategy on the entire Web, this strategy focuses on discovering as many links as possible. The two crawls differ in their scope of coverage, while the KB dataset covers mainly the Dutch domain, the Common Crawl dataset covers websites from the entire Web. Therefore, we used three different sources to identify topics …
Total citations
20182019202011
Scholar articles
T Samar, MC Traub, J van Ossenbruggen, AP de Vries - Research and Advanced Technology for Digital …, 2016