The Need for Highly Available Data
Unplanned downtime and planned downtime
are costly in terms of lost revenue and time.
Planned downtime in one time zone has a
direct impact on the business hours of another
time zone.
It is important to understand certain
key terms and concepts before examining HA
systems and databases.
Failure
Failure
is defined as a departure from expected behavior on
an individual computer system.
Software, hardware, operator and
procedural errors, along with environmental
factors, can each cause a system failure.
Availability
Availability
is a measure of the amount of time a system or
component performs its specified function.
Availability is related to, but differs from,
reliability. Reliability
measures how frequently the system fails;
availability measures the percentage of time the
system is in its operational state.
To calculate availability, both the
Mean Time Between Failures (MTBF) and
the Mean Time To Recovery
(MTTR) need to be known. The MTTR is a measure of
how long, on average, it takes to restore the
system to its operational state after a failure.
If both the MTBF and the MTTR are known,
availability can be calculated using the
following formula:
Availability = MTBF / (MTBF + MTTR)
For example, if the data center fails
roughly every six months (MTBF = six months) and
it takes 20 minutes, on average, to return the
data center to its operational state (MTTR = 20
minutes), then the data center availability is:
Availability = 6 months / (6 months +
20 minutes) = 99.992 percent.
Therefore, there are two ways to
improve the availability of the system: increase
MTBF or reduce MTTR. Having realized that system
failures do occur or are unavoidable, system and
database administrators need to focus on
designing a reliable system with redundant
components, as well as setting up reliable
recovery methodology for when system failures
happen.
Reliability
Reliability
is the starting point for building increasingly
available systems since a measure of system
reliability is how long it has been up and/or
how long it typically stays up between failures.
The nature of the failure is not important — any
failure affects the system’s overall
availability. As presented in the previous
section, MTBF is often considered an important
metric with respect to measuring system
reliability.
-
There are two primary means of
achieving greater reliability:
-
Building high MTBF components into the
system
-
Adding MTBF components in redundant (N+1)
configurations
Serviceability
Serviceability
defines the time it takes to isolate and repair a
fault or, more succinctly, the time it takes to
restore a system to service following a failure.
Mean Time To Recovery, or MTTR, is considered an important metric when
discussing the serviceability of a system or
some component of the system. MTTR, however, is
a unit of time and does not factor into the cost
of service.
Fault-Tolerant
Systems
Another important distinction that
needs to be made is between a high availability
(HA) system and a fault tolerant
(FT) system. Fault tolerant systems offer a higher
level of resilience and recovery. They use a
high degree of hardware redundancy and
specialized software to provide
near-instantaneous recovery from any single
hardware or software unit failure.
Database Availability
When referring to the availability of
databases, the total environment and
infrastructure in which a typical database is
located needs to be examined.
The database application has its own
availability features that are unique from the
system availability point of view.
There are three situations that need to
be considered:
Another important issue relevant for
the database is the need to maintain the
database consistency. Unlike application servers
or other application instances, multiple
database instances or copies of databases cannot
exist. As the database contents change in
real-time, multiple copies cannot be maintained
in a timely manner.