High availability, fault tolerance and disaster recovery

PUBLISHED ON: Wednesday, Jul 5, 2023

#

High availability (HA)

  • HA aims to ensure an higher level of operational performance (uptime), during a higher than normal period.
  • HA doesn't aim to stop failures - customers may face outages
  • HA is not about user experience
  • HA aims at maximizing a system's online time
  • HA requires redundant servers/infrastructure to be in place ready to switch customers to, in the event of a disaster to minimise downtime

Key percentiles

  • 99.9% = 8.77 hours per year downtime
  • 99.999% = 5.26 minutes per year downtime

#

Fault tolerance (FT)

  • FT is a property that enables a system to continue operating properly in the event of a failure of some of its components.
  • FT aims at operating through failures
  • Setting up an FT mechanism is expensive and takes longer time to implement.

#

Disaster Recovery (DR)

  • DR is a set of policies or tools to enable the recovery or continuancy of vital technology infrastructure and systems following a natural or human-induced disaster.
  • DR requires
    • pre-planning
    • backup premises
    • taking regular backups at standby locations (offsite)
    • copies of all processes