Lightning NOT the Root Cause of Amazon Data Center Outage
The Inquirer published this article:
Lightning did not cause Amazon datacentre outage
Interesting to see the root cause analysis of a computer reliability problem being discussed.
First, we could argue if “lightning” could be a root cause. But let’s save that argument for some other time.
But what I found interesting in this article was that they were eliminating a potential cause and then going on to look further.
Looks like it is a power supply reliability root cause analysis. The first step in this process is evidence collection and troubleshooting of the “cause” of the failure.
Since they don’t know the reason that the transformer exploded, finding a root cause is going to be difficult.
It would be interesting to see the process used in this engineering analysis that is in the start of the evidence collection and evaluation process that contributes to the root cause analysis.
Next, the article goes on to discuss problems with the load transferring to backup diesel generators. This would be a second causal factor that needs to be analyzed (troubleshooting and root cause analysis).
The approach for corrective action was mentioned in the article:
– more redundancy and more isolation to its PLCs, in order to prevent failures from spreading,
– a new “environmentally friendly” backup PLC
– improved load balancing
– drastically shorter recovery times
All this will be accomplished “… as soon as possible.”
Of course these corrective actions aren’t very specific (they would not meet the SMARTER criteria in TapRooT®) but they are just a list out of an article. Perhaps the company corrective actions are more detailed.
Also, it is interesting to see additional safeguards being suggested before the failure of the current safeguards are understood.
For cloud computer users, let’s hope a successful root cause analysis with effective corrective action is completed so that future outages can be minimized.