Site Status Update: Intermittent Outages

Posted On: 2023-03-01

By Mark

If you've tried to visit this site recently, you may have noticed some pages failing to load (including the home page itself). Firstly, I want to apologize for the inconvenience, and secondly I wanted to take a brief moment to explain what's going on - and why even though I've fixed the issue(s), there's a possibility of future outages as well.

What Went Wrong

On regular intervals (monthly, right now), I add blog posts to my site. I've streamlined the process over the years, cutting down on deployment time and increasing the reliability of those updates. Central to my process is a tool that I don't directly control: my hosting provider offers a "deploy" command, which not only updates the site's code, but also manages the underlying framework versioning, restarts the server process, and other housekeeping details. It's very convenient, when it works correctly.

Unfortunately, this past month, it failed catastrophically.

Site updates silently failed - repeatedly - and neither my host's support staff nor my own attempts to resolve it had much effect (aside from accidentally knocking the site completely offline for a time.) Eventually, I was able to "resolve" the issue using an old IT staple: try it on a new machine. So, the site that you're reading this on is all-new (freshly provisioned as of last Saturday.) Fortunately, the process of migrating onto the new machine was incredibly simple - in large part thanks to how I built my blog in the first place: unlike most off-the-shelf (CMS-based) blog systems, all I need to do is deploy my code and all the posts are automatically included.

Looking Forward

Unfortunately, although the problem appears to be resolved, I still have no explanation for what actually went wrong. This, combined with a general flakiness of the deploy command (it occasionally failed for no reason, even before the outage), has led me to a place where I am confident that I need a long-term change to my hosting setup. It's still unclear exactly what the new approach will look like, but I can say two things about it with confidence:

  1. I will take full control of every step of deployment: if anything fails, I'll be able to diagnose and resolve it myself.
  2. There will be no visible changes to the site itself: whatever approach I take, it will be running the same code at the same domain name, so everything will look the same as it always has.

Conclusion

This certainly isn't how I'd planned my new monthly schedule to be: first I missed the target date for February by more than a week (due to deployment failures), then (as I tried to solve that issue) I knocked the whole site offline, and now March's (slightly early) post has been largely usurped by explaining this mess. Fortunately, my larger project is proceeding well, and process changes on how I plan and execute my work are paying off, so even if things outside my control seem to keep going wrong, there's still plenty going right. I hope you'll join me next month - hopefully I'll be able to share some of that good news then.