Crisis Management – A Primer
Outages happen. SaaS vendors need to have a robust process to minimize them, manage them to happy conclusions, and learn from them. The 7 steps described here explain how to proceed.
Step 0: Define roles and responsibilities ahead of time
During an outage, emotions and confusion reign. Prepare ahead of time by defining roles for everyone likely to be involved, and in particular:
- The support team, who is usually responsible for customer communication
- The operations team, whose focus is to return the system to a stable stat
- For larger vendors, the marketing communication team, who can help craft the communication both during and after the outage
- Executives and the customer success team, who can help bring resources to bear (execs) and spread the word to customers (CSMs) but are more in a bystander role
Define and document the crisis management process and the communication template to be used in steps 3-5. Practice a few dry runs, including off hours when reaching appropriate individuals may be challenging.
Step 1: Monitor systems to detect outages early
Outages are bad, but they can be a little less bad if you detect them before customers do. Have tools and processes in place to find out about problems right away — and alert internal parties, chief among them the support team.
Plan how the Operations team will be alerted if the outage is discovered by an internal party, or by a customer. This is usually a direct, synchronous contact such as a phone call.
Step 2: Qualify the outage
Not every outage deserve a formal crisis management process. If the outage is very brief, it’s often best to skip the entire customer alert process (in other words, proceed to step 6, post-mortem). The Operations team makes the call of whether to declare a formal outage, based on what is known about the issue and the solution.
Step 3: Alert affected customers
It may be tempting to simply keep quiet when there is a system issue, but customers will likely be incensed if they find that you knew about the issue but did not warn them. The difficulty is to avoid unduly alarming customers who won’t be touched by the issue. Do all you can to identify the customers who are using the server, service, or tool that is experiencing the problem, and alert them and not others.
Alerts are usually proactive, by email, but they could be by text or phone — and they can also be posted on the website to be seen by customers who seek information.
Best practice is to use a template so information can go out quickly and yet be reasonably worded so as not to create panic. Include:
- The nature of the issue
- Its likely resolution time, if known
- A time commitment for the next update
- If appropriate, a method to obtain progress information such as a document in the knowledge base
Having the marketing communication team help craft this and other customer messages is wonderful, but focus on speed rather than craftsmanship at this point (and think about how you will do this at 2am on a Saturday).
Step 4: Send updates at regular intervals
If the outage lasts more than a few minutes, you will need to send updates. Best practice is to:
- Commit to a specific time for the next update in each communication. You can always give an update sooner than promised if a new development occurs.
- Give updates at least hourly unless there is a well-understood resolution path that has a known, longer duration. (Very frequent, systematic updates such as every 15 minutes are usually worthless since they may detract from a quick resolution and not much is accomplished in short intervals.)
- Provide high-level progress summaries. Customers want a full resolution, of course, but will be encouraged if a diagnostic has been made, even if the resolution path is long.
The support team usually sends the updates, relying on predefined templates appropriate for your particular environment and customers. When you create the templates, remember that you may end up sending multiple updates so make sure that they create a spirit of reasonable hope and not just a pile of empty sentiments.
Step 5: Alert affected customers of the end of the outage
Once Operations give a green light, the support team sends the final update. You will want to have a template for outage ends as well.
Step 6: Conduct a post-mortem
Up until this point, all efforts are focused on resolving the outage. But step 6 is the most important step because it uncovers the root cause of the outage and defines long-term fixes.
The Operations team leads the post-mortem and produces an internal report. In turn, the Support team can disseminate an appropriately-edited version to affected customers. This should occur within a few business days of the outage and does not need to occur immediately, especially if the outage is off-hours.
Most vendors review outage post-mortems at the executive level to ensure that appropriate efforts are being deployed to minimize outages, especially repeated outages with the same underlying cause.
Do you have a crisis management process? What works well for you? Share your tips in the comments.