Postmortem: 2021 Holiday Outage

Posted On: 2022-01-10

By Mark

As promised in the update to the previous post, today's post is a mini postmortem for the recent outage/technical issues for this site. The bulk of the issues were avoidable, and thus I hope by sharing what I've learned from the experience, I may help others avoid similar issues of their own.

Watch Your CAA Records

The primary issue at the heart of the outage was a misconfigured Certificate Authority Authorization (CAA) record. When set correctly, the CAA record tells Certificate Authorities (CAs) which CAs have permission to issue new certificates for that particular domain. If an unauthorized CA receives a request for a new certificate they are obligated to reject/ignore the request. This is useful, as it makes it more difficult (though not impossible) for an attacker to acquire a valid certificate, and thereby impersonate a website. When misconfigured, however, it can make it quite difficult to get any certificate at all.

My mistake pertaining to the misconfigured CAA record was four-fold:

When I first configured the CAA, I assumed the organization that I requested the certificate from would use its same name in the CAA entry. In reality, I needed to check their documentation (in this case, the CAA record must permit a different issuer than the organization I was directly working with.)
After configuring the CAA, I didn't test the record. In hindsight this was incredibly foolish: I could have easily issued a request for some unused sub-domain, and by doing so caught my mistake long before it became a problem.
The time between (mis)configuring the CAA record and the actual outage was more than half a year. By that point, I'd forgotten the change - forgotten that I'd left it untested - and thus assumed that I was doing something wrong with my certificate requests.
I took too long to seek external help. I spent more than a week chasing the wrong problem, telling myself "these things take time, just check it tomorrow and maybe it will be working." Once I finally reached out for assistance, they were immediately able to identify what was wrong and get me on the right path to fixing it.

Lessons from CAA Mistakes

In hindsight, it's easy to see those mistakes, and how to avoid them. Reading the (correct) documentation, testing my changes, and minimizing the time between implementing a change and actually using it are all fundamental practices for software development - and no doubt server management as well. If I'd "just done it right" in the first place, there simply wouldn't have been an outage*.

The last mistake - taking too long to ask support for help - provides what is perhaps the most valuable lesson personally. Sure, I may be able to try out several dozen potential solutions in the amount of time it takes me to write a message clearly articulating my problem - but there are a lot of problems that I simply know less about than others do, and taking the time to ask may well save me a lot of time and headache (especially if I'm stuck chasing the wrong problem.)

Surprise Deployment Failures

Beyond the CAA issues, there was also the problem that my previous post, despite being authored and uploaded in time for its release on December 27th, wasn't available until after I'd sorted out the certificate troubles. There were two important reasons why this happened:

As a part of my deployment process, I run smoke tests to verify that the site is functioning correctly. This has saved me several times in the past, as it allowed me to catch and remedy misbehaving code before it took down the site*. While the site's certificate was expired, these smoke tests (understandibly) failed during certificate validation - and thus the new post didn't actually get deployed.
For some reason I still haven't ascertained, I never received notification of the failed deployment: as far as I knew, everything was fine. Since I'd had other things on my mind at the time (ie. the certificate issues), I simply assumed that meant the post was live, and didn't manually check to make sure.

The lesson here is a simple: when it comes to automated processes, one should avoid a "no news is good news" approach. To accomodate this, I've since adjusted my deployment notifications, so that I get success messages - and all that remains is to develop the habit of watching for the confirmation messages (so that, should the automated messages fail unexpectedly again, it'll make me curious enough to manually check things out.)

One thing that's worth mentioning is that the first "problem" is actually not an issue at all. The smoke tests exist to catch system failures early, and prevent them from being silently deployed to the site. In the event of certificate issues, I want the smoke tests to fail: the certificate's state is just as important as the code's stability. I even have a process in place that allows me to manually override such failures (one which I could have used for that deployment - if only I had been aware that it had failed.) So, while the smoke tests are technically the reason why the post didn't reach you on time, I have no intention of changing them.

Conclusion

This wraps up my retrospective for this site's December 2021 outage. At the heart of everything was a configuration error, but a combination of failing up-front to properly test things, unfortunate timing for the outage, and delays in asking for assistance meant that a problem that could have been wrapped up within a day or less stretched on for more almost two weeks.