Authors
Andrei Lavinia, Ciprian Dobre, Florin Pop, Valentin Cristea
Publication date
2011/7/1
Journal
International Journal of Distributed Systems and Technologies (IJDST)
Volume
2
Issue
3
Pages
64-87
Publisher
IGI Global
Description
Failure detection is a fundamental building block for ensuring fault tolerance in large scale distributed systems. It is also a difficult problem. Resources under heavy loads can be mistaken as being failed. The failure of a network link can be detected by the lack of a response, but this also occurs when a computational resource fails. Although progress has been made, no existing approach provides a system that covers all essential aspects related to a distributed environment. This paper presents a failure detection system based on adaptive, decentralized failure detectors. The system is developed as an independent substrate, working asynchronously and independent of the application flow. It uses a hierarchical protocol, creating a clustering mechanism that ensures a dynamic configuration and traffic optimization. It also uses a gossip strategy for failure detection at local levels to minimize detection time and remove …
Total citations
2011201220132014201520162017201820192020202120222023202424465432441
Scholar articles
A Lavinia, C Dobre, F Pop, V Cristea - International Journal of Distributed Systems and …, 2011