Navigating Infrastructure Outages: Battle Scars and Lessons Learned

One of my great passions is infrastructure operations. My first few jobs were in network engineering, and later on, as the cloud became more prevalent, that turned into what we now call infrastructure or cloud operations. Probably the most stressful times were during outages. Over the last two decades, I've been part of many outages, some worse than others, but I wear my operational battle scars proudly. In this blog, I'll share some learnings and observations from these incidents.

What triggered me to write this article

I could write many articles about the "epic" incidents I've been part of. While stressful at the time, they make for good war stories afterward. In recent years, companies like Facebook and Cloudflare have started publicly sharing their outage retrospectives (timelines, root causes, and lessons learned), which I enjoy reading. It's free learning!

A few days ago, Rogers posted its public incident summary, covering the typical aspects of an outage retrospective. The outage itself was significant. Rogers is Canada's second-largest ISP and cell phone carrier. Two years ago, they experienced a country-wide 24-hour outage that impacted all their phone (including 911) and internet customers. Yes, hard down for a full day! So naturally, as both a Rogers customer and internet operations enthusiast, when they posted their report (2 years later!), I had to read it.

The Rogers outage

Though not the primary focus of this article, let's briefly review the outage. During planned maintenance, Rogers staff removed a network routing policy, which caused (BGP, or OSPF/ISIS) routes to be flooded internally. This overwhelmed the control plane, causing all kinds of challenges, eventually causing it to crash or otherwise render its core routers inoperable. A logical next step would be to try and undo that change, however, the routers themselves were no longer manageable due to the control plane issues, and to make matters worse, the out-of-band (OOB) management network was also down. As it turns out, the out-of-band network somehow depended on the production data network, so when that went down, so did the out-of-band network, making troubleshooting and remediation impossible. Eventually, engineers had to be dispatched to manually get console access.

Thoughts on the Rogers outage

Outages suck. They’re super stressful and, as the Rogers incident shows, can have significant real-world impacts. Having these retrospectives after the fact, and getting to the root cause, lessons learned, etc., is mandatory for any organization that wants to continuously improve itself.

One thing that stood out for me was the news article from CBC.ca (a major news outlet in Canada), the headline of which was: “Human error triggered massive 2022 Rogers service outage, report finds.” While it's arguably true that the change made kicked off a sequence of events, declaring that as the root cause is short-sighted. Many of these outages are like airplane crash investigations, where multiple factors contribute to the failure, sometimes starting years ahead or being poorly designed to start with.

Reflecting on Outages: Can We Eliminate Them? Should We Want To?

It’s always easy to dunk on an outage from the sidelines and to be clear, that’s not my intent. I’ve learned firsthand that Shit happen! So, with the Rogers outage in mind, but also from past experience, let’s extract some generic lessons.

First off, I’d say that the goal of eliminating outages altogether is likely too ambitious for most of us. Yes, part of an outage retrospective process should be to prevent the exact same outage from happening in the future, but architectures evolve, technologies change, and even a slight change in parameters can lead to a similar outage again.

So, although we can limit the chance of outages, you won’t be able to eliminate them 100%. In fact, given the cost of complete outage elimination and depending on your industry (say Facebook vs. flying a passenger airplane), you may not necessarily want to aim for that. There are real financial costs, process costs, agility costs, or time-to-value costs by trying to completely achieve outage elimination.

So, let’s accept that outages will happen and instead use that insight to focus on limiting their damage and impact! This approach allows us to prioritize effective strategies, which I'll discuss next.

Time to Detection - aka Be The First To Know!

Sufficient monitoring helps to reduce time to detect and pinpoint root causes. Sounds obvious, right? Yet, almost every outage retrospective will have an improvement action item called “more monitoring.” That’s an indication that we’re always forgetting some metric for some reason we never appear to have all the checks and metrics. That’s why I’m a big fan of synthetic monitoring for availability.

This check should mimic a typical user experience. So even if you swap out a core component, you’ll still be able to see the impact without adding more metrics. Because you can’t easily miss something, it’s the first check you should add, and most importantly, should alert on. So whether it’s a ping or a simple HTTP test, this will give you instant confidence your service is working. I call this the “be the first to know” principle. You don’t want to hear from your customers that your service is broken; you’re responsible for it, and you should be the first to know!

Limit blast radius.

Can we contain the impact when something breaks? The Rogers outage is a prime example of the worst case: one change broke the entire network. Logical separation, whether geographical or functional, could have prevented this. Modern-day cloud providers like AWS actively encourage us to think about this. With AWS, you’re always thinking about Multi-AZ or Multi-region. Some go further by going multi-cloud (think multi-vendor in network land).

However, transitioning from a single region to multi-region or multi-cloud can become exponentially complex and costly. The same is true in network land, where the path from a single vendor to multi-vendor can become quite complex easily.

Certain services are easier to make multi-region than others (I’m not even thinking about multi-cloud yet). Take your core database, for example. This comes with real complexity costs, and it's understandable why this might not always happen.

I get it, infrastructure teams are often considered cost centers. While high availability and reliability are touted as priorities, urgent needs, feature development, and cost pressures often take precedence. However, never let a good disaster go to waste! After a significant outage, budgets and priorities often shift. This is your opportunity to advocate for investments in resilience. The value proposition is strongest in the immediate aftermath, so seize the moment.

Management access

The outages that still haunt me and gave me my “Ops PTSD” are those where we lost management access during the outage. This is what happened to Rogers, and a similar incident happened to one of Facebook’s largest outages. It’s a real nightmare scenario: you can’t troubleshoot or remediate the issue since you’ve lost the ability to do anything to the device. Often even the ability to just shut it down or otherwise take it out of rotation. Until you restore some form of management access to the device, you’re stuck.

This problem is more present for folks running physical hardware than those running, say, on AWS. So, if you run your own data centers, never compromise on out-of-band network access, without thinking about the worst case. Make sure you have OOB access and that it is isolated from your in-band network access. It's not always easy, but if you don't have this, your outage time will go from minutes to many hours (see Rogers and Facebook's example). The feeling of helplessness, even if you know what the issue is, and it only takes one command to fix the issue, is awful.

Rollback scenarios

Another valuable tool in your toolbox for reducing the impact of an outage is to make sure its duration is limited. You made the change, you quickly see things are bad, and you simply roll back your change. You can see that for this to work, you need management access though (see above).

In the early days, with Cisco IOS, we’d do it like this:

# reload in 5
# conf t
# < do your change >
# no reload

This meant that unless you executed the last command, the device would reload 5 minutes after your change. Assuming you wouldn't lose access, you would execute no reload, which would cancel the reload, since it wasn’t necessary. Juniper later introduced a more elegant solution with "commit confirmed," which auto-rolls back changes if not confirmed within a set time frame.

[edit]
user@host# commit confirmed 
commit confirmed will be automatically rolled back in 10 minutes unless confirmed
commit complete
#commit confirmed will be rolled back in 10 minutes
[edit]
# < On-call blows Up! >
user@host# rollback 1
user@host# commit

If your environment is more of a software development team, then look at using blue-green deployments. Whatever your choice of making changes or deploying new versions of your software, it’s essential to have the ability to quickly roll back to the latest, known working version. Knowing that you can quickly roll back is great for confidence. Obviously, this depends on our earlier topics: “be the first to know” and management access.

Team culture - Don’t be a hero

Last but not least, we have to talk about team culture. One of the good things about moving from the traditional Ops model to DevOps (you build it, you own it) is that you no longer throw a new version of the software over the wall and ask your ops team to deploy it sight unseen.

Inevitably, at some point, you ship a broken version and your Ops person (this was me) has no clue what changed or how to troubleshoot it. This is extremely stressful and, frankly, unfair.

Nowadays, most development teams own and deploy their own changes and have modern deployment pipelines to deploy them. But even then, the software is complex, and not everyone in the team understands all components. So, when there’s an outage, it should be easy to bring in the rest of the team. Don’t be shy about calling your team members, even in the middle of the night. In a healthy organization, no one should want to wake up the next morning hearing there was a 4-hour outage and someone on your team tried to troubleshoot the bug you introduced. So have a WhatsApp group, or whatever tool you use, just for your team, your own out-of-band channel for emergency help.

Outages are overwhelming. You get a ton of alerts that all need to be acknowledged, and you get questions from support teams, customers, senior management, etc. And then you’re also expected to troubleshoot. No one person can do this. You need a few folks to troubleshoot, manage the alerts, and coordinate the whole process, including communication. So, don’t be a hero. Call in your teammates; together you’ll be able to deal with this. Remember, one team, one dream.

Wrap up

Outages will inevitably occur. Let's acknowledge this reality and take proactive steps to limit their impact and duration.

Be mindful of the knee-jerk management response: “We need more change management process!” Unless you’re a real YOLO shop, this is rarely the answer.

Instead, use the “be the first to know” principle, ensure access to your gear at all times, and have tools ready to swiftly roll back changes to a last-known-good state. Last but not least, embrace the opportunity to learn with your team through retrospectives and emerge even stronger. With a healthy team dynamic, outages can surprisingly become valuable bonding experiences. By being prepared and working together, you'll turn outages into mere blips on the radar.