At the end of last week, one of my clients experienced some server downtime. Their internal tool would work for a moment, then fail with an error message. This happened frequently but inconsistently.
Unfortunately, I had just left the office for a family trip about five minutes before the first alert came in. I wouldn’t be able to even look at the problem for another eight hours.
Situations like these always raise two important questions: How do we prevent this in the future? And, just as crucially, How do we handle it better next time?
Stick with me over the next few days to learn how we tackle these issues in general, and how this specific incident was resolved.