Back to Basics: Root Cause Analysis

Many thanks to Robin Kirby and several others for asking about root cause analysis.

What is it?

Root cause analysis consists in reviewing past support cases and other customer interactions (including community activity) to determine the reasons for customers’ issues with the product or service and to address their root cause, with the goal of improving the overall customer experience. This can mean improving product quality, adding features, upgrading the documentation or knowledge base, providing training or onboarding assistance, etc.

Why do it?

A better customer experience makes for more loyal customers, more renewals, and more referrals (in other words, it’s the root of anti-churn). It also decreases support costs — but in my mind this should be the least important reason to perform root cause analysis. Like many support processes, root cause analysis is quite straightforward, but is often neglected because of time constraints, despite the huge potential payback.

How to do it?

There are three steps to root cause analysis, the first one being absolutely crucial to the success of the other two: collect appropriate data, analyze it, and act on it.

About data collection:

Define tags or categories that will be helpful for post-hoc analysis. The taxonomy should be designed specifically to illuminate the distribution of the issues being reported so be entirely pragmatic. In other words, don’t try for a sedate, encyclopedic, Dewey Decimal System-like taxonomy. If you are using a hierarchical category tree, which most tools support, don’t hesitate to place right on top of the tree a type of issue that constitutes a very large volume of requests.
Separate the root cause from the categories. If you cram the root cause into the category tree, you will likely create lots of duplication and a maintenance nightmare. Instead, recognize that root causes are orthogonal to categories and should be tagged separately. The list of root causes is short and quite universal: software bug, user error, question, documentation error, hardware fault, third-party problem.
Tag each request (again, we are not just looking at support cases here, but also at communities and other interactions, online or not) with categories and root cause. This is where a minimalist approach to categories pays off: fewer categories won’t tax the support engineers too much, so adoption will be easier. Better collect two pieces of accurate data than four sloppy ones.
Use links between cases and knowledge base documents. Categories and root causes are handy, but for a chance at a finer level of root cause analysis, exploits the links between cases and knowledge base documents. Think wide about the universe of documents. For instance, it would be handy to link to a specific course description when the customer needs training: handy for the customer, who has a complete record of the recommendation, handy for cross-selling, and handy for root cause analysis that can capture the exact training need.

Run regular metrics on categories and root causes. Note trends, both emerging and stable clusters, and run more detailed investigations as appropriate. It takes a trained support engineer a couple hours at most to sort through a cluster of issues and identify actionable steps.

Act on the findings! I like to maintain a “top 10” list of bugs and enhancement requests that impact the largest number of customers (or cause the most pain, or impact large customers the most). The Top 10 list can be the basis of satisfying dialog with Engineering and Product Marketing,

What is your root cause analysis process? What are your recent successes? Do share in the comments.