Authors
Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna
Publication date
2018/6/1
Journal
ACM Transactions on the Web (TWEB)
Volume
12
Issue
2
Pages
1-26
Publisher
ACM
Description
Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine systems, and, at the same time, scales linearly with the amount of resources available. This article aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors’ experience with UbiCrawler [9] and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousand pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols to achieve very high throughput.
Total citations
2011201220132014201520162017201820192020202120222023202411591710171923151710
Scholar articles
P Boldi, A Marino, M Santini, S Vigna - ACM Transactions on the Web (TWEB), 2018