HOME - MEMBERSHIP - DataBus
Databus Issue: 2003 4 10/02/2003

Why Do I need Fault Tolerance?

Rick Corl Consultant
What does downtime cost a school district or county office? PDF

A few simple precautions can protect your data and keep your professional reputation and working relationships healthy. Education organizations should understand the difference between fault tolerance and data protection. Many organizations will implement some sort of fault tolerance measure after a bad event has occurred, and for many organizations, particularly in education, the concept of fault tolerance extends only as far as nightly backups or perhaps weekly tape backups. Tape backups protect our data and provide us with an insurance policy against excessive downtime. Tape backups do not protect us from downtime. However, they do provide a convenient method of quick recovery once the hardware issue has been resolved.

Fault tolerance is a critical aspect to consider when mission critical servers are involved. A well-designed fault tolerant system will be capable of withstanding problems such as hard disk crashes and failures of network cards, motherboards, controllers, CPU’s, or memory chips. Fault tolerance allows the computer to continue operating when components fail. A computer that is fault tolerant will automatically substitute another available (healthy) component to take the place of the faulty component. Once the faulty component has been repaired or replaced, it can be placed back into service, all without bringing the computer down during the repair process.

For example, the use of Mirroring or RAID (Random Array of Inexpensive Disks) is a means to make hard drives on a mission critical server fault tolerant. If one drive fails, there are other drives running with duplicate data to stay running. Mirroring is the easiest and most basic implementation for hard disk fault tolerance. Mirroring simply keeps an exact copy of one disk on another (mirrored copies). If one drive fails, the other drive is available to take over. RAID is very similar to mirroring. RAID distributes parts of data from one disk over two or more disks. A factor when considering RAID is that it is faster than a mirror configuration. A mirror configuration writes data simultaneously to a dual drive. RAID is faster because the data is spread out among multiple disks.

Hardware mirroring or RAID is more stable and reliable when using a hardware solution rather then using a software solution. A hard drive controller card called RAID controller has its own unique BIOS (Basic Input Output System) settings. The configuration of the RAID setting is done through the controller hardware setup. Often, the multiple drives are configured as one logical disk drive thereby allowing the operating system to “see” and use one hard drive. You may have five physical hard drives; however, the controller organizes the drives as one logical drive and passes the information to the operating system.

Software mirroring or RAID is a cheaper solution because it does not require additional hardware to run. Most software mirroring is operating system-based, meaning that the operating system is responsible for configuring and managing the hard drives. Software mirroring is less reliable because the operating system has to boot without faults. If you cannot boot your operating system, your data may be lost. Dynamic Disk Mirroring is more reliable and is available in Windows 2000 and addresses the boot failures to recover your data. However, hardware mirroring/RAID is far more reliable and is easier to recover from when a drive fails.

Clustering is another fault tolerance method. Clustering is similar to mirroring/RAID. It is a fault tolerance tool that is applied to the entire computer. A special clustering card is place in two or more identical servers. If one server fails, the identical server automatically takes the place of the failed server. This is the simplest form of fail-over for mission critical servers because the server switches over to a backup when the primary fails and you can work on the down server without experiencing downtime. Clustering is a very expensive solution for fault tolerance and requires special hardware and multiple servers. Clustering also requires operating system support. Windows 2000 Advance Server and UNIX are capable of supporting clustering fault tolerance solutions.

Load balancing provides fault tolerance and has multiple uses. Load balancing distributes requests to multiple servers, allowing the servers to share the workload. A common use for load balancing is in a multiple Web server environment. A server sits on the outside of a network and directs traffic to another Web server with the least amount of traffic/use. Load balancing is a very scalable solution because you can keep adding servers to compensate for the additional workload.

Five simple rules for putting it all together:
Rule 1 – Know what the system is supposed to do.
Rule 2 – Look at what can go wrong.
Rule 3 – Know your applications. Understand the requirements of your application.
Rule 4 – Determine what kind of fault tolerance you need.
Rule 5 – Determine the level of availability you need for your mission-critical applications or business-critical servers.

It’s a sure bet that a problem will occur at some point, usually when you least expect it. Resolving that problem under the pressure of “get the network running again” is likely to earn you the unwelcome scrutiny of the superintendent’s office. If network resources aren’t continuously available and reliable, then instructional staff will not aggressively integrate them into the curriculum and you will lose respect and credibility. Developing network resources that are fault tolerant helps to ensure that network resources will remain available even when we experience the occasional system problem. Fault tolerance will keep you out of the superintendent’s office and in the catbird seat!


Upcoming Events

Annual Conference 2011
11/08/2011 - 11/11/2011
Long Beach, California

Annual Conference 2012
10/16/2012 - 10/19/2012
Monterey, California

Annual Conference 2013
11/19/2013 - 11/22/2013
Pasadena, California

Annual Conference 2014
11/18/2014 - 11/21/2014
Sacramento, California