Failure of the Internet or Intranet
Internet or Intranet uptime is typically the
responsibility of the Network Administrator and
is not a component that a DBA usually has
control over.
Failure of the Internet connection,
usually due to the provider, means no one
outside the company Intranet can access the
application. Failure of the Intranet or internal
networks means no one inside the company can
access the application. These components,
usually comprised of multiple components, should
also have built-in redundancy.
As important as implementing the
redundancy is, the redundancy should also be
regularly tested to prove that the design
functions as intended.
Failure of the Firewall
The firewall regulates the flow of traffic
between networks of dissimilar trust levels.
The Internet is a no trust zone, the
demilitarized zone
(DMZ) is an intermediate trust zone and the
internal network is a trusted zone.
A proper firewall configuration will
implement a default-deny rule-set and only allow
network connections that have been explicitly
set.
No firewall is needed if the database is
strictly on an internal network with no
connection to the Internet.
However, if users access the database
through or from a non-trusted zone, such as the
Internet, and there is only one firewall, a
failure will prevent anyone outside the firewall
from contacting the database. Internal users,
those inside the firewall on the same network,
may still have access.
Failure of the Application Server
The application server usually serves the web
pages, reports, forms, or other interfaces to
the users of the system. If there is only a
single application server and it goes down, even
if the database is fully functional, there is no
application to run against it. A failed
application server without redundancy means no
one can use the database, even if all other
components are still functional.
Failure of the Database Server
The failure of the database server is the one
failure that is taken care of in a normal RAC
configuration. Failure of a single database
server leads to failover of the connections to
the surviving node. While not a critical failure
that will result in loss of the system, a single
server failure means a reduction in performance
and capacity. Of course, a catastrophic failure
of both servers will result in total loss of
service.
If the application is mission critical, consider
sizing the servers so that a surviving node can
handle the load of the failed instance without a
noticeable reduction of service.
The servers will have disk controllers or
interfaces that connect through the switches to
the SAN arrays. These controllers or interfaces
should also be made redundant and have multiple
channels per controller or interface. In
addition, multiple network interface cards
(NICs) should also be redundant with at least a
single spare to take the place of either the
network connection card or the cluster
interconnect should a failure occur.
Failure of the Fabric Switch
The fabric switch
allows multiple hosts to access the SAN array.
These switches communicate via FCP (fibre
channel protocol).
A fabric switch is different than a
typical Ethernet switch because the protocol is
different and FCP supports redundant paths
between multiple components, creating a mesh
network.
This design is important for I/O failover
and I/O scalability.
Failure of one redundant fabric switch
can result in loss of performance.
Complete fabric switch failure will
result in a full RAC crash. If the RAC shared
disk is unavailable, the Oracle RAC instances
are worthless.
SAN Failure
Failure of a single drive can result in severe
performance degradation. During a disk failure
in a RAID
-5 array, the replacement or hot spare disk has
to be rebuilt using parity information found on
the surviving drives in the RAID-5 set.
During this RAID-5 rebuild process the
RAID-5 I/O performance will suffer by as much as
400-1000 percent.
Failure of a RAID
-0+1 drive has little effect on performance as
its mirror drive takes over while the hot spare
is rebuilt on an “on available” basis. In a
RAID-5 array, the drives are usually set up in
an n+1 configuration, meaning n drives in a
stripe set and one parity drive.
When a drive fails, there must be an immediate
spare available to replace it; even if the hot
spare is not available, a cold spare should be
available.
If the hot spare has already activated
and a second drive is lost, the entire array is
in jeopardy. Most of these arrays use hot
pluggable drives meaning they can, in time of
failure, be replaced with the system running.