Computing technology has become pervasive and with it the expectation for its ready availability when needed, thus basically at all times. Dependability is the set of techniques to build, configure, operate, and manage computer systems to ensure that they are reliable, available, safe, and secure.1 But alas, faults seem to be inherent to computer systems. Components can simply crash or produce incorrect output due to hardware or software bugs or can be invaded by impostors that orchestrate their behavior. Fault tolerance is the ability to enable a system as a whole to continue operating correctly and with acceptable performance, even if some of its components are faulty.3
Fault tolerance is not new; von Neumann himself designed techniques for computers to survive faults.4 The premiere conference in the area, the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), held its 50th edition in 2020. In Latin America, the first edition of the Brazilian Symposium on Fault-Tolerant Computers (SCTF) was held in 1985 and took place for 18 consecutive years. In 2003 it evolved into the Latin American Symposium on Dependable Computing, which has since been held in multiple countries. Today, research groups are firmly established, and premier international events have been held in the region, such as DSN in Rio de Janeiro in 2015, the ACM Principles of Distributed Computing (PODC) conference in Mexico in 1998, and the IEEE International Symposium on Reliable Distributed Systems, which has been held twice in Brazil, in Florianópolis in 2004, and in Salvador in 2018, among others.
No entries found