equipment failure 2

Last month, Delta Airlines experienced an equipment failure that caused their reservation system to shut down, Media reports indicate close to 2,000 flights were canceled. This is only a few weeks after Southwest Airlines experienced a similar computer failure, causing numerous flight delays and cancellations.

Reports continue to indicate that this was an equipment failure, due to a small fire in a power supply in there server room.  Here is their description:

“Monday morning (August 8) an uninterrupted power source switch experienced a small fire which resulted in a massive failure at Delta’s Technology Command Center. This caused the power control module to malfunction, sending a surge to a transformer outside of Delta, resulting in the loss of power. The power was stabilized and power was restored quickly. But when this happened, critical systems and network equipment didn’t switch over to backups. Around 300 of about 7,000 data center components were discovered to not have been configured appropriately to avail backup power. In addition to restoring Delta’s systems to normal operations, Delta teams this week have been working to ensure reliable redundancies of electrical power as well as network connectivity and applications are in place.”

Keep in mind that the “uninterrupted power supply switch” is actually known as an “uninterruptible” power supply (UPS).  This normally swaps you over to another power source if your primary source fails.  You may have a simple UPS on your computer systems at the office, providing battery backup while power is restored.  In Delta’s case, their UPS system attempted to switch over, but configuration issues prevented a significant number of their devices from actually shifting over.

Additionally, other reports indicate that the reservation system is an extremely antiquated system, linked into other airlines’ (also extremely antiquated) systems.  They have all patched together and upgraded their individuals systems to the point that it is almost impossible to upgrade; it really requires a complete replacement, which would be EXTREMELY difficult and expensive to replace while still being used for current reservations.

So while this is discussed by the airlines as an equipment failure, I think there are more than likely multiple causal factors, of which only one (the initiating problem) was a burned up component.  Without knowing the details, we can see several Causal Factors:

  • A UPS caught fire
  • This small fire caused a large surge and widespread power loss
  • Other equipment was not properly configured to shift to backup power
  • There is no backup in the event of a loss of the primary reservation system
  • The reservation computer system has not been upgraded to modern standards

I always question when a failure is classed as “equipment failure.”  Unless the equipment failure is an allowed event (Tolerable Failure), it is much more likely that humans were much more involved in the failure, with the broken equipment as only a result.