Root Cause of Failure of Telephone Banking System
Software written internally by HSBC caused an intermittent failure (don’t you hate those) of Mastercard’s Maestro system last weekend. This caused thousands of HSBC’s customers to be unable to make purchases or withdraw cash.
The bank is now conducting a “major incident review” that should be completed by Friday. The review will look at the problems with the software and why recovery took so long (four hours after the offending software was removed).
How is a root cause analysis of a software failure different than the root cause analysis of a equipment failure or a human error that causes an explosion or plant shutdown? Really, there isn’t a difference in the tools to use. The only difference is the technology involved.
I found this out back in the 90’s when working with Gerald Starling at BellSouth. He used TapRooT® to investigate telecommunications incidents (network reliability, 911 outages, etc.). These were often software issues. And using TapRooT®, he found fixable root causes that improved performance.
The technology (network reliability) was very different than the types of investigations I had perviously performed. Even though I am an electrical engineer, the terminology of network reliability was completely foreign to me. Yet the reasons for human errors and system failures were in the Root Cause Tree® (part of the TapRooT® System).
The reason for this is that the causes of unreliable human performance (mistakes – human errors) are the same no matter what type of technology the human is involved with. Therefore, the ways to achieve reliable human performance are a basic part of the analysis that TapRooT® helps an investigator perform.