Some years back, I worked as the CTO. During my tenure, I had a head of IT support reporting to me. He did his job quite well and had a commendable sense of duty and responsibility, and I will always think of him as a model employee.
I recall an oddly frustrating conversation that I had with him once, however. He struggled to explain what I needed to know, and I struggled to get him to understand the information I needed.
Long story short, he wanted me to sign off on switching data centers to a more expensive vendor. Trouble was, this switch would have put us over budget, so I would have found myself explaining this to the CFO at the next executive meeting. I needed something to justify the request, and that was what I sought.
I kept asking him to make a business case for the switch, and he kept talking about best practices, SLAs, uptime, and other bits of shop. Eventually, I framed it almost as a mad lib: If we don’t make this change, the odds of a significant outage that costs us $_____ will increase by _____%. In that case, we stand to recoup this investment in _____ months.
In the end, he understood. He built the business case, I took it to the executive meeting, and we made the improvements.
As much as we might like it, people in technical leadership position often cannot get into the weeds when talking shop. If this seems off-putting to techies, I’d say think of it this way. Techies hack tools, code, and infrastructure, while managers and leaders hack the business.
Tools and Incident Management
I offer this introduction because it illustrates a common friction point in organizations. Techies at the line level do their jobs well when they both concern themselves with their operational efficiency and when they focus heavily on details. This can lead to some odd patchwork systems, optimized at the individual level, but chaotic at the organizational level. Here, the tech leaders feel the pain.
Any org with incident management concerns may find themselves in this position. I’ve seen such approaches run the gamut from sophisticated approaches centralized around ZenDesk to an odd system of shared folders in Outlook to literally nothing except random phone calls.
Often times, the operations management of incidents is born out of frenzied necessity and evolves only as a reactive means of temporarily minimizing the chaos.
Unfortunately, that near term minimization can lead to worse long term problems. And so you can find yourself in charge of a system full of disparate tools, each beloved by the individuals using them. But taken all together, they lead to organizational misses and maddening opacity.
Does this describe your situation. If you’re not sure beyond the part about fragmented tooling, consider some symptoms.
First, and most obviously, does your system completely miss detection of incidents. If you, as a technical leader, find out about operational incidents yourself, you’re experiencing misses. This should not happen.
A Byzantine incident management process across various tools will lead to incidents that somehow fall into a black hole. This might happen because systems fail to capture the incidents. Or, it might happen because the systems botch or lose them in communication with one another. And finally, it might happen simply because your process has such a terrible signal to noise ratio that no one pays attention.
Let’s assume that your process catches most issues. That doesn’t mean that you’re necessarily out of the woods. Once identified, do you get efficient resolution?
Maybe your team routinely struggles to reproduce issues from the information available. Do many issues get kicked back to the reporters labeled, “could not reproduce?” Do you routinely have angry users?
If reproduction doesn’t present a problem or isn’t necessary, do you have sufficient information to find out what happened? Or do bits and pieces get lost, leading to guesswork and longer resolution times?
And, in terms of assignment and communication, do your people know who should work on what and when? Does this require them to log into several systems and deal with ambiguity?
Insufficient Post Mortem
Another sign of system complexity comes in the form of issue post mortems. (You do retrospect on root cause, don’t you?) If retracing an incident through its lifecycle gives you fits, you have a problem.
But, beyond that management should have a coherent window into this in the form of a dashboard. After all, improving operations is what management is supposed to do. When I mentioned “hacking the business” earlier, I meant this exact thing. You need the ability to audit and optimize organizational level processes.
If you find yourself entirely reliant on anecdotal information from individuals or if you find yourself mired in random log files, you have an issue.
Alerting to the Rescue
Your absolute first step is to establish a reliable alerting infrastructure. Effective incident management hinges upon the right people having the right information as soon as humanly possible. This means alerts.
To alleviate the pain points above, you need to focus on two key points. Reminiscent of David Allen’s wisdom in “Getting Things Done,” here they are.
- Make sure nothing can possibly slip through the cracks and that the system captures and alerts about everything it needs to.
- Limit the rate of false positives and issues to ensure that everything receives full attention.
I have offered you deceptively simple wisdom here because the devil lies in the details. But if you keep your eye on these two overarching goals, you’ll eventually see improvement. Find a way to guarantee the first point, and then work through the pain of saturation, making your alerts and responses more efficient. Oh, and it never hurts to improve your products to produce less alert-worthy problems.
Consolidate and Standardize
Once you’ve got efficient alerting in place, you need to standardize around it. Look to minimize the number of different platforms and tools that you have to use to eliminate knowledge duplication and impedance mismatches from your workflow.
I do not intend to say that you should seek the one operational tool to rule them all. Rather, I mean that you should opportunistically eliminate ones that have mostly similarities or that realize only a tiny fraction of their value proposition.
The key underlying principle here is one that any good DBA or software engineer could address: the aforementioned knowledge duplication. Make sure that you have a single, authoritative source of truth for all incident related information. And then, make sure that your alerting infrastructure draws from this well.
Layer Dashboard on Top
Last but not least comes making your own life easier and your time more effectively spent. With proper alerting in place and with the consolidation battle won, give yourself dashboards to make your decision making much simpler. No more peering at log files and weeding through inboxes to calculate response times. Make sure you have all that at a glance.
If you’re going to make business cases and hack the organization, you can’t spend your time talking shop and putting out fires, however much that might appeal. You need to switch from tactical to strategic mode and put yourself in a position to speak to the impact that various response times and incident importance thresholds have on the bottom line. Your fellow managers or members of the C suite will thank you.