The System is Down

Neon Genesis Evangelion, © 1995 Gainax

Michael Mechmann
Michael Mechmann

After reflecting on the entirety of the year I’ve spent in quarantine, it’s also (almost) time to commemorate the anniversary of the event that made me leave one of my previous jobs, something which was an absolute nightmare at the time but is actually amusing to read in retrospect. Luckily, one of my colleagues from back then is already doing the work of writing the whole thing up. You can check out Part One which describes some context and the initial disaster; future installments will cover what actually went wrong, which should make it clear why I had to leave. Names have been changed to protect folks, and I will not be giving away who I am in this sordid tale, so for now just read and enjoy. It’s the sort of harrowing story that begins with the alert nobody wants to get: not one, but all of our production systems were down.

As for my thoughts on the whole incident, the Serverpocalypse and subsequent Servergate live on in my memory as a time of almost pure stress and adrenaline, a mad flurry of emails and desperate phone calls punctuating long days dredging up every bit of sysadmin knowledge I’d ever picked up. Day 1 I remember as mostly panic, and Day 2 as the day I personally drank two Dunkin’ Donuts® Boxes of Joe™®© by myself. The rest was a slog of restoring backups and rebuilding virtual production infrastructure from scratch.

Some additional context on the lead-up might help explain how this was possible. For years we had been pushing to move our infrastructure to the cloud, because none of us were really qualified to run an enterprise-grade virtual server farm and we certainly didn’t have the budget to do it right. We were mostly getting away with it, but at the time we were in the middle of building a much more complex application that would have to serve many more people than our other applications (a project I hear never made it to production before being canceled well after I’d departed). And I won’t spoil the rest too much, but we had known we were skating on thin ice for a long time before the disk failure that proved to be the final crack, and our department’s warnings and cries for help went pretty much ignored until everything went pear-shaped, at which point things were–let’s say–not handled in the way I would’ve expected or preferred.

At the time this happened, I had been growing increasingly discontent with the way our genuine needs and concerns were being treated by upper management. Ilona and I were already discussing what my moving on might mean, and she was actually a big proponent of moving to California for more opportunities (plus we both hate NYC winters). This failure event, combined with a surprise change to the aforementioned major project’s timeline that happened very shortly before, are what caused me to being searching for a new job in earnest, and what indeed brought us to California after all.

It’s fitting that this story’s being put out now — due to other surprising factors outside my control my current place of employment is closing up shop. I had been really hoping to stay on with them for a long time, but now only just under two years later I must again begin job searching in earnest. I wouldn’t be where I am if not for this catastrophic event, so it’s one of the strange bad events that leads to good outcomes. But it still fucking sucked, man.

Anyway, hopefully the rest of the story will be out soon and I can gripe semi-anonymously about that too.