My new next-door neighbor is a fellow diver and a food critic. During our conversations, she kept asking me what attracted me to the DIR system so much. As we talked more about our interests in food and wine and stuff, it became apparent that there’s a pattern here.
In my drafts though, I have an entry about how I won’t dive anything that will be different at 100 feet from how it behaves at 10 feet because I lose the power of simulation (a dive computer that displays unusual numbers if you have a deco obligation, than when you don’t.) You’ll remember Netflix famously uses their Chaos Monkey system. What is the purpose of all this?
All of us train for contingencies, even when we don’t quite use the language. You put logging in your software. You put diagnostics in your hardware. You construct breakout boards for microchips, so you can easily connect scopes and meters to the leads. The fire seargent for your floor takes a mandatory CPR class.
In most of these scenarios, we never question what a failure looks like or feels like – because you’ve been told that in the case of a fire, specific things will happen. If you see the symptoms, assume there is a fire.
In exploratory scenarios though, one rarely “knows”. I learned this the hard way at work – my models of what alerts and alarms would go off when my services go down was rarely how real life worked. The signs and symptoms where always there. They just weren’t what I expected. As Morpheus would say, “There’s a difference between knowing the path and walking the path.” When shit goes down, it doesn’t go down in isolation. If there was a fire, would my building shake? In that split second, how do I decide whether to use the fire contingency or the earthquare contingency?
I now realize why doctors are required those long and grueling residencies and internships. When I have a seizure, I do not want my doctor second-guessing the symptoms. They have to “know” what the heck is going on and react almost on instinct.
It is the same principle DIR applies to diving. Whatever could “theoretically” happen at 100 feet, my instruments need to simulate that in a swimming pool on demand when I have safety divers around me. If your knob gets twisted and shut off, or your hose twists for the first time, or you jump out of the boat without having your valves turned on, that is not the time to have that “aha” moment. Those seconds between “feeling” and “thinking” are valuable seconds that you’re not breathing.
In a services filled world, nowhere does this apply more, than in software. Netflix’s Chaos Monkey didn’t only teach them to build stateless machines, but I’m betting it taught them what a pre-failure looks like. It made them chase false alarm after false alarm, look in random places because that’s where the symptoms showed, until they learned where they SHOULD be looking. This is certainly not a cheezy self-help rule that goes: Ignore intuition. This is a supplement to intuition, and also a validation of your intuition. If you’ve set up the correct constraints, then going through a failure is simply a way to verify them.
Ostritching only increases the danger
(Ostritching is a verb I made up – which is to ignore the symptoms and pretend there is no problem.)
A common refrain I notice, against testing or validating the big boasts and claims is that they may not work. A lot depends on why you approach the problem. If you’re the CEO of Netflix, your interest is far more in ensuring Netflix *works* as opposed to ensuring you *look good in reviews*. In this case, if something is unlikely to work, you really really want to know about it. Not in the sense of blame, but because you want to fix it. (For a diver, staying alive is incentive enough.)
Do you find yourself asking questions that are answered with eye-rolls, accusations and anger? I meet such people frequently. That is not the sign of a person who has never failed. They are worse than people who have experienced failure: These are people who refuse to learn from it.
Wishing you many more failures
So there you go. I wish you the best of luck in failing under controlled circumstances. Whether it be your servers, or network wires or gas hoses. It is better to have failed under controlled circumstances than failed when you least intend to, and both are better than to have failed at all. The last one never happens – the only choice is, will you train yourself head-on, or will you be caught by surprise when you don’t want to be?