Make WordPress return 200OKs instead of 503 (Internal Server Error)’s when in maintenance mode

When wordpress goes into “maintenance mode”, i.e. when it is updating itself or plugins/themes, it throws a page to the end user telling them it is updating.

When it goes into this mode, it begins throwing HTTP Status Code 503’s (Internal Server Error).

This is a huge problem for people running wordpress containers behind a load balancer. WordPress isn’t broken. It’s not irrecoverable. It’s doing a job. It’s doing the right thing. ELB’s don’t know this and making them accept 503 as a healthy check is stupid.

So how do you change that to something better? Like a 200 OK (hey everything is FINE. I’m serving valid content, which happens to be a maintenance page.)

Edit the file:

$WORDPRESS_ROOT/wp-includes/load.php

Find the function wp_maintenance:

/**
 * Die with a maintenance message when conditions are met.
 *
 * Checks for a file in the WordPress root directory named ".maintenance".
 * This file will contain the variable $upgrading, set to the time the file
 * was created. If the file was created less than 10 minutes ago, WordPress
 * enters maintenance mode and displays a message.
 *
 * The default message can be replaced by using a drop-in (maintenance.php in
 * the wp-content directory).
 *
 * @since 3.0.0
 * @access private
 *
 * @global int $upgrading the unix timestamp marking when upgrading WordPress began.
 */
function wp_maintenance() {
        if ( ! file_exists( ABSPATH . '.maintenance' ) || wp_installing() ) {
                return;
        }

        global $upgrading;

        include( ABSPATH . '.maintenance' );
        // If the $upgrading timestamp is older than 10 minutes, don't die.
        if ( ( time() - $upgrading ) >= 600 ) {
                return;
        }

        /**
         * Filters whether to enable maintenance mode.
         *
         * This filter runs before it can be used by plugins. It is designed for
         * non-web runtimes. If this filter returns true, maintenance mode will be
         * active and the request will end. If false, the request will be allowed to
         * continue processing even if maintenance mode should be active.
         *
         * @since 4.6.0
         *
         * @param bool $enable_checks Whether to enable maintenance mode. Default true.
         * @param int  $upgrading     The timestamp set in the .maintenance file.
         */
        if ( ! apply_filters( 'enable_maintenance_mode', true, $upgrading ) ) {
                return;
        }

        if ( file_exists( WP_CONTENT_DIR . '/maintenance.php' ) ) {
                require_once( WP_CONTENT_DIR . '/maintenance.php' );
                die();
        }

        require_once( ABSPATH . WPINC . '/functions.php' );
        wp_load_translations_early();

        header( 'Retry-After: 600' );

        wp_die(
                __( 'Briefly unavailable for scheduled maintenance. Check back in a minute.' ),
                __( 'Maintenance' ),
                // !!!!!! CHANGE THIS to 200
                503
        );
}

And change that last 503, to a 200.

Now when your containers are updating, your load balancer won’t kill them. I suspect this will help all sorts of use-cases even for Kubernetes services without having to tell your health checks to tolerate 503s, which are rightfully kept for termination.

Don’t be fooled by Plugins which change maintenance pages. Those plugins throw a “user-land” page, i.e. it’s still on top of wordpress. This maintenance mode is “kernel-land”, i.e. in wordpress itself, not a page on top of wordpress. Seeing that hardcoded constant should tell you there’s no way to intercept it.

New Solids of Constant Width

When everyone’s special, no one is

You’ve probably seen this video before.

Aside from a 10-second gimmick, what benefit do these objects hold for us non-mathematicians?

Why we use Wheels and Balls

Spheres are nature’s preferred shape. When we need something to roll on, we think of wheels and balls.

We have tried other shapes every so often with entertaining results:

Spheres will be spheres…

For all the benefits of a sphere, it has one huge drawback: its symmetry.

Every time I look away

Spheres make everyone special. Whether they want to play marbles, croquet, fetch or billiards, they will serve you faithfully. When you don’t want them rolling everywhere, they’ll continue to get under your feet at midnight.

When everyone’s special, no one is.

none is important

Can I have a sphere… but only for me?

The best things in life make us feel special compared with everyone else. What if we had objects that behaved like balls in specific cases, but did not act like balls in all the others?

Such objects are called solids of constant width (in two dimensions, shapes of constant width).

We recognize a few shapes in the animation above. The sphere, obviously, but also the Meissner tetrahedron (looks like a pyramid) and a revolved Reuleaux triangle (looks like an ice-cream cone).

Here are a few others:

This time, no sphere at all. A Meissner tetrahedron, and revolved Reuleaux pentagon and heptagon. The book doesn’t know the difference.

But these objects are boring — algorithmic and predictable. Is there a way we can get exotic shapes? I recently found a paper titled On Curves and Surfaces of Constant Width by Howard L. Resnikoff, which presents a few new shapes never-before seen and not “revolved polygons.” I rendered these in 3D printable models, and this is how they function:

This time, there are two exotic-looking blobs and a revolved Reuleaux pentagon. The book spins exactly the same as before.

Semantic Identity for You

Semantically, each solid is considered “identical for the purpose of rotating a book.” What this means is for that particular purpose, all objects might as well be the same.

This is a fundamental concept we are deeply in tune with at Polyverse. A semantically identical text-editor could be anything that, under the load of editing text, you wouldn’t know the difference.

Now You Are More Special Than Everyone Else.

Polymorphism for Everyone

For every purpose other than spinning a book on top of them, such as rolling helter-skelter, the physical morphology of our objects differs from a sphere and they quickly come to a halt. Moreover, once word spreads that your rolling objects can’t be used for anything but spinning books in animated GIFs, it deters attempts to steal them.

Consider a Polymorphic Text Editor that under the wrong load, such as impersonating a kernel device driver, simply comes to a halt. This unusual behavior at once Detects something is amiss, while at the same time Defends against misuse, and Deters people trying. It simply isn’t shaped like a kernel device driver. It’s useless.

https://upscri.be/9816bc

Trust is a Decision

A guide on “How to Trust”

On the eve of Re:Inforce, I wanted to share my thoughts on how to trust, elaborating upon something I mentioned in a recent podcast. This post is as applicable socially as it is technologically.

All of us struggle with trust to some extent. It is easy to know what you don’t trust. A person with serial criminal charges. A company with consistently bad behavior. It is far harder to proactively trust… because this person, this entity has not done anything bad — or at least that I know of? Doesn’t have quite enough strength to it.

Intuitive attempts to improve trust don’t always work

Have you put your trust in people and they let you down? Did you sign all your binaries, yet someone still managed to inject a backdoor, or leaked private keys? Did you give your developers freedom, only to feel embarrassed on launch day?

We tend to have a number of natural reactions after an untrustworthy event:

  1. Increase deterrence: More code reviews. More code coverage. Double the signatures on every binary. More static analysis. More compliance rules. Withholding rewards. Suspicion. Paranoia. If they won’t listen to me the easy way, I’ll make them do it the hard way. Unfortunately, this only punishes everyone who didn’t mess up, but in no way deters those who did.
  2. “Do” something: This is the equivalent of searching for your keys on the street where the light is, rather than the pitch-black garage where they were lost. Can we use some OOP design pattern? Blockchain might help — it has that “crypto” in it. Maybe using the lessons Netflix learned playing videos around the world might help. Chaos Monkey? Simian Army? That new scanner tool uses AI now. And so on. This is a perfectly normal reaction.When we feel helpless, it can be very comforting just to have some sort of agency.
  3. Make it somebody else’s problem: Maybe the problem was that one programmer. What if we open-sourced it? If more eyeballs see it, it will be better. How about publishing our binary on a blockchain? Then the group can build “consensus” about it. Consensus will make it a better binary. The problem is, as George Bernard Shaw famously said, “Do not do unto others as you may have them do unto you. They may not share the same tastes.

In a way, our minds are attempting to reconcile the sunk-cost fallacy (having lost $100 on a project with no results, maybe if I invest only $10 more, I’ll get a $200 payoff) with the law of averages (if ten of my past projects failed due to chance, the next is bound to succeed; I can only fail so many times.)

Trust is not an attribute or property; nor is it a lack thereof

The biggest fallacy with trust is to mistake it for an attribute or property.

The software went through 20 code reviews, has 80% code coverage, and was signed three times. Seems fairly trustworthy to me!

My bootloader is overly complex and does 800 steps, documented in an 8,000-page architecture diagram. Must be trustworthy.

This message came through a Blockchain. Five thousand people copied it and verified its hash. How can it be false, if 5,000 people verified the hash?

Sarcasm aside, there is something good to attributes though. There’s some value to having code reviews, code coverage and signatures. A trustworthy bootloader is definitely going to be a bit more than starting at address 0x0000 and executing what it finds. There’s definitely value in 5,000 people coming together and cooperating.

So what is the line between progress and parody? It is mistaking that trust arises after you’ve done all the right things.

Trust is not an outcome

It is not so much that entities are untrustworthy, but that we misplace our trust.

Trust does not arise naturally or organically because certain things happened.

You wouldn’t go through life buying every car before you settled on one you liked. You may not even want to try every single car without some idea of the features that you want. Nor would you eat cardboard because you don’t personally know of anyone having been killed by ingesting it. You definitely wouldn’t ingest poison oak because 5,000 people agreed that it is indeed, poison oak.

This is all just information. Having more of it, without direction, doesn’t materially change anything. Five thousand people agreeing on whether something is poison oak has value, if you had previously decided whether or not poison oak was safe to eat. A car is easy to settle on, if you knew that there exists no better car for what you had previously decided you wanted out of it.

The way we intuitively reason and are taught about trust is in the wrong direction. We are taught to trust that individual because they are a Buddhist monk. We are never taught to figure out what outcomes we desire, and ask what about a Buddhist monk makes them trustworthy.

The used-car-salesman stereotype will always go along the lines of, “You can trust me because I sign all my binaries and statically analyze them.

The irony is: I do trust them. It is not so much that entities are untrustworthy, but that we misplace our trust. I would trust a cat to hunt mice. If I had a mouse farm, I would trust a cat to continue to hunt my mice.

A tip on how to effectively parse information: Rephrase anyone’s trustworthiness pitch to start with with your desired outcome, “Since you, my customer, want the outcome of non-backdoored machines, I promise to insure you for $1 million, against your total data risk of $500K. I accomplish this outcome by signing binaries, which is where the highest risk of backdooring comes from.

Trust is a decision

Trust is the decision you make to demand outcomes.

The decision of what side-effects you want/can tolerate has to come first. The assurance that you/your employees/vendors/contractors are doing no less than necessary, and no more than sufficient comes right after.

First comes the decision to seek certain outcomes. Second come the verifications that are necessary and sufficient to ensure the outcomes.

Trust is the decision you make to demand outcomes.

Making the decision sets everyone (except malicious actors) free

Trust as a decision scares the crap out of malicious and sleazy actors, while empowering those who defend your interests.

Remember how I wrote about Security as a function of asymmetry? Everything worth having in life is about asymmetry of incentives.

Making the decision to trust works in much the same way — unfairly rewarding the good and unfairly punishing the bad.

Here’s a common scenario I hear from my banking and AI friends. My banking friends would love nothing more than to have certain complex analytics problems solved. My AI friends would love nothing more than to solve those problems. The challenge? Data regulations such as Europe’s General Data Protection Regulation (GDPR), which is seen as a burden. (Personal note: I’m extremely pro-GDPR; I don’t think it’s an unnecessary burden. Open to changing my mind.)

As a consumer how do you feel about your bank or physician sharing data with a third-party company? I feel very uncomfortable. What can they do to make us feel good? Prove that they are HIPAA compliant? Prove that they have security controls? Prove that they are deadly serious? I suppose these are good steps in the positive direction, but what, for example, do I know about HIPAA compliance and its applicability?

Let’s twist the argument around and look for side-effects we want instead. If I deposit a dollar in my bank, then at some arbitrary day in the future, any day of my choosing, I MUST be able to withdraw that dollar. That is what I want. That is the outcome I desire.

I couldn’t give two entries in a blockchain on how the bank ensures this happens. I don’t want to know. I don’t care. What I do care about is what the bank has to lose if they do not give me at least a dollar back. Let’s say the bank has to pay me $10 every time they screw up my dollar. I’d be pretty satisfied with that incentive structure. I would diversity my $100 across 10 banks, knowing that they are all competing NOT to lose $100 on my $10.

See how this has two immediate effects. Some banks that may be complaining about regulation, will start desperately wishing that regulation was all that was required of them, rather than direct, personal, material loss. The banks that ARE trustworthy, however, will gladly sign said contracts and win business. If they want to share my data with an AI company, they’re going to make damn sure that data is protected and, frankly, at that point I don’t even care if my data gets lost or stolen. Because I’m assured of getting either $10 or $100 back.

Trust as a decision scares the crap out of malicious and sleazy actors, while empowering those who defend your interests.

So here’s my hopeful guidance for everyone jaded with trust issues — decide when to trust. Ask what side-effects of the existence of a vendor, technology, methodology or action are. Decide to trust, don’t get bullied into it.

Can we just cut to Infrastructure-As-Declarative-Code?

The missing link between Declarative Infra and Infra-as-Code

Everyone who is allured by the promise of Declarative Infrastructure, “declare what you want, and don’t worry how it happens”, eventually seems to end up at half-baked verbose clumsy templating.

Nobody looks at 50 yaml files (with templates even) and can read a crisp declaration: this is a secure, throttled, HA wordpress instance! It’s so clear and obvious!

A long-overdue reaction to verbose templating is the new allure of — Infrastructure-As-Code. Pulumi and the AWS CDK offer these beautiful compact clean parameterized abstractions (functions, objects, etc.)

There is discomfort though from the Declarative camp — was the promise all wrong? Was the implementation wrong? What gives? We know we should want it… but the imperative definitions really are readable.

I’m here to tell you the world is offering us a false choice — always has. What if I told we could have “Declarative Systems” and “Infrastructure-As-Code” all in one? Let me explain…

We’re confusing “Static Configuration” with “Declarative Programming”

I’m going to go after (allegedly) the “worlds first system that allows you to declaratively provision infrastructure”.

Kubernetes YAML is not and never was Declarative Programming! It was a magic trick you bought into because of all the grandiosity that came with the sales pitch.

Kubernetes is a Statically Configured system. What were DOSs .ini files in the 80s, /etc/*.conf files in the 90s, are the YAML for Kubernetes. When we kubectl apply we are writing bunch of strings to a KV store with pomp and grandiosity that made us believe we were doing “Declarative” something something. Don’t hate me just yet — because if we accept this, we can build a world that is oh so much more beautiful!

Infrastructure doesn’t have an Assembly Language

Writing a “Higher Level Language” on top of Kubernetes’ alleged “Assembly Language” makes about as much sense as writing C using regular expressions.

Even if Kubernetes were the kernel, YAML is NOT the Assembly Language, because it is missing The Language. Most charitably, the Kubernetes resource model would be “Registers” for a language that doesn’t exist; and they’re really just data constants not even registers.

You know you can write a regular expression library in your favorite programming language — Javascript, C#, Lua, Elm, Ballerina, whatever. Can you write your favorite programming language in Regular Expressions?

Now compare Assembly Language to Java, Javascript, C#, C++, Elm, Go, Rust, etc. You can write any of these in any of the other ones.

That’s the difference — Assembly Language is not a “Lesser Language”, it is a “Lower Level Language”. It can do everything the others can do — no more, no less.

Writing a “Higher Level Language” on top of Kubernete’s alleged “Assembly Language” makes about as much sense as writing Java using regular expressions.

This is the essence of why templating looks awkward, and Infra-As-Code looks better, but… feels like sacrificing the very promise that attracted you to Declarative systems in the first place.

What you want is a Declarative System + The Language => Declaractive Programming!

Declarative Programming is neither new, nor do you need to have planet-scale problems for it to be useful. If you were tempted by Declarative Infra for the promise of describing your complete apps in a repeatable, portable, and all-in-one-place style, what you wanted was: purity/idempotence, parameters, and closures.

You were promised a Declarative Assembly Language, but you were given Data Registers.

Imperative programming gives you better abstractions than templating, but it still doesn’t understand them — you are still expressing HOW you want Data generated, not WHAT you want as an overall Goal.

There is a better way! There is a Declarative Way of writing programs, where Predicates are what would be loops and conditionals in imperative programs. Relations are what would be functions in imperative programs. Facts are what would be data in imperative programs. Assertions are what would be tests in imperative programs. If you haven’t already, Play with Prolog.

Declarations are not these dead blobs of static YAML! They’re living breathing encapsulations of understanding and knowledge that makes the platform grow and be better!

A Declaratively Programmed Infrastructure Platform

I wrote a mini-spec on twitter and I want to get past the theory and describe what a proper Declarative App/Infra Platform would look like (whether in/on/above/under Kubernetes or not.)

Desires

Desires are what you express. The system doesn’t change them, touch them, modify them, mutate them, etc. Strictly no admission webook mutation api aggregation server controller whatsoever on a desire. You can express anything in the world you want. You can Desire to be paid 1 million dollars. You can Desire a Ferrari.

In today’s world, what you kubectl apply would always be a desire. It represents nothing more nothing less, and nobody gets to change it, modify it or argue that you shouldn’t want what you want.

Facts

Facts are things the system simply knows to be true. A fact would be what is the /status sub-resource today or a Node. No more weird ugly resource/sub-resource bullshit that everyone is modifying and awkwardly versioning with ever-complex merge algorithms. Just straight up “Fact”. I “Desire a Pod” so the system gave me the “Fact of a Pod”. Two independent first-class entities.

Predicates

Predicates are “knowledge” the system has as it learns. A Predicate applies constraints, adds information, removes information, but in a DECLARATIVE way.

For example, today if you knew all Scottish Sheep are Blue, you can’t declare that knowledge to any Declarative Infrastructure. You have to enter EACH sheep in Scotland as being blue either through templating or through a “real language”. Not only is one verbose, one clumsy and one non-declarative, the real travesty is that valuable knowledge was lost that others cannot benefit from. Nobody else can read the code and infer that you really wanted to state that “All scottish sheep are blue.”

In Declarative Programming, though, you can have it both ways! Enter Predicates. You can at once know whether any particular sheep is blue, and also remember the general rule that all Scottish Sheep are Blue so others can benefit from it. You don’t give one up for the other.

More concretely let me write you some predicates that the system would understand first-class. These aren’t some clumsy custom controllers responding to a Shared Informer using a ClientSet with CodeGen’d Golang API packages and iteratively using level-triggered reactions to set annotations. Ugh so many words! No no, these are fundamental Declarations to the Assembly Language of the system! This is the system’s Language not static configuration.

  • All Pods not fronted by a service -> Ready to Garbage collect

(Note that I didn’t write all those Pods should be marked as Ready to Garbage Collect. They ARE ready to garbage collect — you don’t tell the system WHAT to do, simply what you KNOW.)

  • All Services with HTTP endpoint -> Invalid
  • All Pods with older-than-30 days SSL keys -> Invalid

Once declared, the predicates teach and improve the system itself. The system understands them. Not in operators or controllers or some third-party templating place.

Relations

Finally, what makes all this work magic is Relations. Relations are what teach the system how to accomplish Desires. A Relation teaches it how to take a Desire and relate it to more Desires and/or Facts.

The system simply breaks down high-level Desires, until all Desires have a Relation to a Fact. Then it manifests those Facts. It can also tell you when a Desire cannot be met and why. Is a Predicate getting in the way? Is a resource depleted?

Let me illustrate:

  • Relation: Desire: Exposed Service(ip) -> Fact: IP Open to Internet
  • Relation: Desire: Secure Service(domain) -> Desire: Secure Proxy(domain, ip_of_insecure_service, domain_cert) + Desire: Insecure Service + Desire: Domain Cert(domain)
  • Relation: Desire: Secure Proxy(domain, ip, cert) -> Fact: Nginx Proxy(domain,ip,cert)
  • Relation: Desire: Insecure Service -> Fact: Insecure Service
  • Relation: Desire: Domain Cert(domain) -> Fact: Domain Cert(domain)

Now I collapsed a few steps here, but it doesn’t matter. You get the point on how to Relate an Exposed Secure Service, to an Insecure Service and Cert Generator.

This is all we need. Let’s bring it all together!

A System with 1000+ Microservices at Planet Scale described As Code 100% Declaratively

Let’s look at how the cluster comes together:

  • Vendors/Sysadmins/Whomever provide Relations that convert Desires to Facts. This is below the app-interface you are supposed to see. If a Desire doesn’t become Fact, the Vendor/Admin/Operator is paged. A clear boundary.
  • InfoSec/Business/Compliance injects predicates. They define what constraints they want. What can and cannot be true.
  • App-Developers provide a Desire. That is all they do. They get back if it can or cannot be met. They get back if it cannot be met, WHY it cannot be met — a predicate got in the way, missing Relation i.e. don’t know how to break down a certain Desire to Facts, A Desire->Fact relation errored and here’s the error details.

Now we have a Declaratively Programmed Infrastructure — where knowledge is not lost, we get full programming, we get full declarative-ness, and we get even more things:

  1. We can ask WHY something exists. Who desired it? Are they allowed to desire it?
  2. We assert things and make overarching statements. We can say “All Pods in EU always store data in EU”. We can simply MAKE such a powerful statement.
  3. We can add constraints and not scan/validate/audit. If a constraint exists, it simply makes certain Facts untrue, i.e. they are unmanifestable.
  4. We can compose higher Desires over lower Desires.

If I were sold THIS Declarative system with the lowest Assembly Language today, I would buy it.

Whether I’ll ever get to use one, I don’t know — but I sure hope so.

Amazon is strip-mining Consultants, not Open Source

Software Development is a great business model; Software-installation never was

If you’ve been following the latest incarnation of OSS licensing drama, you’ll see hints of the BLA (Bad Linux Advocacy) fervor of the late-90’s/early-2000’s, the OpenStack drama of the early 2010’s, and the Kubernetes drama being played out today.

Like my therapist says about all my terrible relationships — if everyone turns out to be toxic in the same manner, ask what was common across all of them?

Microsoft and Windows may have lost. Linux may have won. If you weren’t around 15 years ago, Linux didn’t create the jobs everyone thought it would. Supposedly, if Linux was capable, but needed babysitting, there’d be a cottage industry of Apache-configurators, NIS/LDAP maintainers, NFS-managers, etc. If I had to name the category, it’d be the DBA. Those jobs never materialized, and Red Hat became the scapegoat for making it too simple.

On the other hand, open source jobs and businesses are thriving. Building software is making all the money in the world. Employers are complaining about being ghosted — candidates who sign offers but never show up, candidates who tell you they quit by disappearing, and so on. More people are being paid more money to develop software in the open. Your Github account is your resume.

So what exactly is the deal with community licenses, Amazon hosting open source “without giving back”, etc. Are we really sore that Amazon isn’t giving back, or are we sore that they figured out the trap? I’ll come to this in a bit when we discuss OpenStack, the Amazon-killer from 2014.

The Consultant Trap

Money is the cheapest resource you have.

Do you remember OpenStack? It was absolutely horrible — I’ve run it.

OpenStack had to accomplish two mutually exclusive tasks at the same time.

If you were a >$1billion dollar revenue company with 100 dedicated people, it had be easy for you to run. If you were less than that, it had to be difficult.

If it were trivial, it would destroy Amazon, but nobody would pay YOU. If it were too difficult but desirable, Amazon had all the capital to do it better than you.

It had to allow IBM, Rackspace, VMWare, etc. to take customers away from Amazon, while all at once, preventing a thousand others from doing the same.

OpenStack upstream actively fought attempts to make it simple, usable or obvious, because that was going to be why you’d want a vendor. If you dare me, I’ll hunt down pull requests, bug reports and issues to prove the point.

It’s a trap, because it is a consciously maintained gap (like a trapdoor function in cryptography), where the key looks like you. If you fill in that gap, the thing works. If you are not present, it doesn’t. That’s how you distribute a free product, but ensure you will be paid.

Why Amazon?

Can you imagine Microsoft or Google just simplifying something because you want it? Something that didn’t come from their billion-dollar research divisions based on running things at “mission-critical enterprise” for the former, or “galaxy scale” for the latter? They will pontificate around the Parthenon in Togas in order to figure out the true meaning of what “a database” is, and what higher-order manner the world should see the data.

It’d also become a “promotion vehicle” to tack on pet-projects (that’s what really killed the Zune, Windows Phone, Windows 8 and so on.) It’d have to have vision, strategy, synergy, dynamism, a new programming paradigm for someone’s promotion, a new configuration language for another promotion, etc.

Amazon scares the crap out of me. They’ll spin up a team of ten people, give them all the money and ask them to solve the damn problem in the simplest, stupidest way possible; promotions, titles, orgs be damned. Bezos has heard that a customer wants a cache. Make the cache happen or go to hell. You can see them doing that comfortably.

So how do you beat Amazon?

Make it EASY!

Have you learned nothing? I have two examples: WordPress, and RedHat. Amazon would never host WordPress. What can they do except make it painful, complicated, behind that terrible UX of theirs? What it doesn’t do, is require a painful number of consultants for getting started. Despite being easy, the WordPress hosting business is thriving!

Red Hat is similar. It works. Oracle copied the code base and slashed the price in half; got nowhere. If you really worry about complex kernel patches, you’re going to pay for Red Hat’s in-house expertise. If you don’t need them, half-the-price doesn’t do much.

Every project should learn from this. Vendors are salivating over the opportunities in Kubernetes, Service Meshes, Storage Plugins, Network plugins, and that will be their downfall. Ironically, if all of this were trivial to run, they would still get paid to host/manage it, and perhaps by more people. Amazon gets in a bind: They can’t “manage” a service that doesn’t need management.

If your business model is maintaining that perfect TrapDoor, you’re going to be strip-mined. License be damned.

The best way to make something toxic for Amazon is to make it so goddamn trivially consumable, that the AWS console and AWS CLI feel like terrible things to ever have to deal with.

On the other hand, making open source easy, simple, consumable and useful will continue to find those who will pay for hosting and management. You will continue to get paid for new feature writing.

Amazon will trip over themselves making it look uglier and stupider with their VPCs and Subnets and IAMs and CloudFormations and what not. That is how you bring Amazon down.

Semantic versioning has never versioned Semantics!

If you changed the sort() implementation from MergeSort to QuickSort, do you up the major version?

My rants against SemVer are famous. Here’s a meta-rant: Semantic Versioning is broken at a higher level than what it does. It doesn’t version semantics (meaning, behavior, effect, outcome), but rather versions binding signatures (basically any linking, loading, IDE-calling, etc. won’t error when binding.)

The semantic meaning of semantic versioning is itself gaslighting. At best, it should be called SigVer (Signature Versioning).

It doesn’t version “interfaces”, which may have a more concrete runtime contract expectation.

For example, consider the difference between the two:

// V1.0.0
function sort(arr int*, len int) {
// Do sort here
}

Now, suppose we realize people are passing in nil pointers, so we add a type-check.

// V1.0.0
function sort(arr int*, len int) {
if (NULL == arr) {
// error, panic, change signature?
    }
   // Do sort here
}

This is interface versioning. The agreed-upon contract is now changed. How is this communicated to a caller? Major version? Minor Version? Patch Version?

If you don’t think about this, then you could get away with patch version, because the binding is compatible.

The problem is far deeper than merely changing an interface contract. What happens when we change the one thing Semantic Versioning promises to version: Semantics?

What if in one version, the function sort(arr struct*, len int, cf compareFunc*); is implemented using a MergeSort, and in the later version, using a QuickSort? All sorts of unit tests and data justify this change — this is reliable, safe, and you can stand behind it.

For a moment ignore the debates around side-effects and performance. Look at this consumer code:

struct home {
city char*
state char*
}
function SortAlphaByState(homes (struct home *), len int) {
    // Let's sort by city first
sort(homes, len, compareByState)
    // Let's sort by state (preserving city-ordering)
sort(homes, len, compareByCity)
}

MergeSort is a stable sort. QuickSort is usually not.

The Semantics have changed. This has major downstream implications to business logic, airlines, space craft and all the mission-critical stuff we want to defend.

Do you bump up Major version? Minor Version? Patch Version?

The Docless Manifesto

“undocumented” code is the new ideal

Documentation is necessary. It’s essential. It’s valuable. For all its purported benefits though, even towards the end of 2018, it remains relegated to a stringly-typed escape-hatch to English, at best.

Let’s dissect that problem statement below, but feel free to skip the section if you nodded in agreement.

The Problem: Documentation is a Stringly-Typed escape hatch to English (at best)

String-Typing: A performance-review-safe way of saying “no typing”.

String-typing is the result of two constraints placed on developers:

  1. We must be strongly typed; we’re not hipsters — we enforce strong opinions on consumers of our program/library/platform.
  2. We must be extensible; we don’t know if we have the right opinion necessarily, and are afraid of getting abstracted out or replaced if we end up being wrong.

So how do you have an opinion that can never be wrong? You “enforce” the most generic opinion humanly possible:

// C/C++
(void *) compare(void *a, void *b)
// Java - but hey, at least primitive types are ruled out!
static Object UnnecessaryNounHolderForFunction compare(Object a, Object b)
// Golang
func compare(a interface{}, b interface{}) (interface{}, error)

There it is! Now we can never have a wrong opinion, but gosh darnit, we enforce the opinion we have — don’t you be sending stuff the programming language doesn’t support! We will not stand barbarism.

How does anyone know what those functions do?

We can solve this problem in one of two ways:

  1. You can just read the source code and stay up-to-date on what it does at any given iteration (and make the code easy to read/comprehend.)
  2. Or we can double-down on how we got here in the first place: If we had a highly opinionated but extensible way to document what it does, we could document anything: Welcome to Code Comment Documentation!

The strong opinion: You can add a string of bytes (non-bytes will NOT be tolerated; we’re not barbaric), that can represent anything (extensible). It is left to the user to interpret or trust anything said there — ASCII, Unicode, Roman Script, Devnagri, English, French, Deutch, whatever — it’s extensible, but at least it’s all bytes. We won’t parse it, store it, matain it, or make any sense of it in the compiler — we will contractually ignore it.

Why we don’t program in English in the first place?

Past the low-hanging fruit about English (or any natural language) being specific to a region, culture, or people, there is a much deeper reason English is not used to write computer programs: English is semantically ambiguous.

It is remarkably difficult to get two people, let alone a machine, to make sense of an English statement in the exact same way. Which is why we build programming languages that define strict semantics for every statement.

So how does a program written in a deterministic programming language whose meaning can’t be made sense of, become easier to comprehend when rewritten in the form of a code comment document in a language that fundamentally cannot convey accurate meaning? It doesn’t, of course. But much like CPR, which works far less than most people think, it is a form of therapy for the developer to feel like they did everything they could.

A second problem, that annoys me personally, is the blatant violation of DRY (Don’t Repeat Yourself.) You write code twice — once in your programming language, and once again in English (at best — because that’s all I read.) I have to usually read both if I have to make any reliable bet on your system. Only one of those two are understood unambiguously by the computer, you and me. The other one is not even understood by the computer, while you and I may not interpret it in the same way.

We can do better!

We have the benefit of hindsight. We have seen various constructs across programming languages. We can and should do better!

  1. Unit tests are spec and examples — they demonstrate where inputs come from, how to mock them, and what side effects should or should not happen. A Unit-Test becomes at once, a usage document, but also an assurance document.
  2. Better type-systems allow us to capture what we accept.
  3. Function/Parameter annotations can provide enumerable information to a caller.
  4. Function Clauses and Pattern Matching allows the same function to be better expressed in smaller chunks with limited context.
  5. Guards can merge input-validation and corresponding “what-I-don’t-accept” code comments into one — a machine-parsable enforceable spec + a beautiful document to read.
  6. Documentation as a first-class part of language grammar (that homoiconic influence from Lisp again.) In at least one system in Polyverse, we made code-comments enumerable through an interface (i.e. object.GetUsage() returns usage info, object.GetDefaultValue() returns a default value, etc. etc.) This means we can query a running compiled system for it’s documentation. We no longer have version-sync problems across code, or fear of loss of documentation. If all we have left is an executable, we can ask it to enumerate it’s own public structs, methods, fields, types, etc. and recreate a full doc for that executable.

All of this information is available to us. We intentionally lose it. It annoys me every day that I read someone else’s code. What else did they want to say? What else did they want to tell me? How is it that I force them to only communicate with cave paintings, and blame them for not explaining it better?

This led me to write:

The Docless Manifesto

A Docless Architecture strives to remove arbitrary duplicate string-based disjoint meta-programming in favor of a system that is self-describing — aka, it strives to not require documentation, in the conventional way we understand it.

The Docless Manifesto has two primary directives: Convey Intent and Never Lose Information.

Convey Intent

The question to ask when writing any function, program or configuration, is whether you are conveying intent, more so than getting the job done. If you convey intent, others can finish the job. If you only do the job, others can’t necessarily extract intent. Obviously, when possible, accomplish both, but when impossible, bias towards conveying intent above getting the job done.

Use the following tools whenever available, and bias towards building the tool when given a choice.

  1. Exploit Contentions: If you want someone to run make, give them a Makefile. If you want someone to run npm, give them a package.json. Convey the obvious using decades of well-known short-hand. Use it to every advantage possible.
  2. Exploit Types, Pattern-Matching, Guards, Immutable types: If you don’t want negative numbers, and your language supports it, accept unsigned values. If you won’t modify inputs in your function, ask for a first-class immutable type. If you May return a value, return a Maybe<Type>. Ask your language designers for more primitives to express intent (or a mechanism to add your own.)
  3. Make Discovery a requirement: Capture information in language-preserved constructs. Have a Documented interface that might support methods GetUsage, GetDefaultValue, etc. The more information is in-language, the more you can build tools around it. More so, you ensure it is available in the program, not along-side the program. It is available at development-time, compile time, runtime, debug-time, etc. (yes, imagine your debugger looking for implementation of Documented and pulling usage docs for you real-time from a running executable.)
  4. Create new constructs when necessary: A “type” is more than simply a container or a struct or a class. It is information. If you want a non-nullable object, create a new type or alias. Create enumerations. Create functions. Breaking code out is much less about cyclomatic complexity and more about conveying intent.
    A great example of a created-construct is the Builder Pattern. At the fancy upper-end you can provide a full decision-tree purely through discoverable and compile-time checked interfaces.
  5. Convey Usage through Spec Unit-Tests, not Readme: The best libraries I’ve ever consumed, gave me unit tests for example usage (or not usage.) It leads to three benefits all at once. First, I know that it is perfectly in sync without guessing, if the tests pass (discoverability), and the developer knows where they must update examples as well (more discoverability). Secondly, it allows me to test MY assumptions by extending those tests. The third and most important benefit of all: Your Spec is in Code, and you are ALWAYS meeting your Spec. There is no state where you have a Spec, separate and out of sync, from the Code.

Never Lose Information

This is the defining commandment of my Manifesto!

Violating this commandment borders on the criminal in my eyes. Building tools that violate this commandment will annoy me.

The single largest problem with programming isn’t that we don’t have information. We have a LOT of it. The problem is that we knowingly lose it because we don’t need it immediately. Then we react by adding other clumsy methods around this fundamentally flawed decision.

My current favorite annoyance is code-generation. Personally I’d prefer a 2-pass compiler where code is generated at compile-time, but language providers don’t like it. Fine, I’ll live with it. What bugs me is the loss of the semantically rich spec in the process. It probably contained intent I would benefit from as a consumer.

Consider a function that only operates on integers between 5 and 10.

When information is captured correctly, you can generate beautiful documents from it, run static analysis, do fuzz testing, unambiguously read it, and just do a lot more — from ONE definition.

typedef FuncDomain where FuncDomain in int and 5 <= FuncDomain <= 10.
function CalculateStuff(input: FuncDomain) output {
}

When information is still captured for you (in English) and the programming language (through input validation), you lose a lot of the above, and take the additional burden unit-testing the limits.

// Make sure input is between 5 and 10
function CalculateStuff(input: int) output {
if (input < 5 OR input >= 10) {
// throw error, panic, exit, crash, blame caller,
// shame them for not reading documentation
}
}

This borders on criminal. What happens when you pass it a 5? Aha, we have a half-open interval. Documentation bug?Make sure input is between 5 and 10, 5 exclusive, 10 inclusive. Or (5,10]. We just wrote a fairly accurate program. It’s a pity it wasn’t THE program.

We can certainly blame the programmer for not writing the input validation code twice. Or we may blame the consumer for having trusted the doc blindly (which seems to be like the Pirates Code… more what you call a guideline.) The root problem? We LOST information we shouldn’t have — by crippling the developer! Then we shamed them for good measure.

Finally, the absolute worst, is where you have no input validation but only code comments explaining what you can and cannot do.

I could write books on unforgivable information-loss. We do this a lot more than you’d think. If I was wasn’t on the sidelines, Kubernetes would hit this for me: You begin with strongly-typed structs, code-generated from spec because the language refuses to give you generics or immutable structures, communicated through a stringly-typed API (for extensiblility), giving birth to an entire cottage industry of side-channel documentation systems.

In a world where information should be ADDED and ENRICHED as you move up the stack, you face the opposite. You begin with remarkably rich specs and intentions, and you end up sending strings and seeing what happens.

So here’s my call to Docless! Ask yourself two questions before throwing some string of bytes at me and call it “Documentation”.

  1. Have you taken every effort to convey intent?
  2. Have you missed an opportunity to store/capture/expose information that you already had before duplicating the same thing in a different format?

The SLA-Engineering Paradox

Why outcome-defined projects tend to drive more innovation than recipe-driven projects

In the beginning, there was no Software Engineering. But as teams got distributed across the world, they needed to communicate what they wanted from each other and how they wanted it.

Thus, Software Engineering was born and it was…. okay-ish. Everything ran over-budget, over-time, and ultimately proved unreliable. Unlike construction, from whence SE learned its craft, computers had nothing resembling bricks, which hadn’t changed since pre-history.

But in a small corner of the industry a group of utilitarian product people were building things that had to work now: they knew what they wanted out of them, and they knew what they were willing to pay for them. They created an outcome-driven approach. This approach was simple: they asked for what they wanted, and tested whether they got what they asked for. These outcomes were called SLAs (Service Level Agreements) or SLOs (Service Level Objectives). I’m going to call them SLAs throughout this post.

The larger industry decided that it wanted to be perfect. It wanted to build the best things, not good-enough things. Since commoners couldn’t be trusted to pick and choose the best, it would architect the best, and leave commoners only to execute. It promoted a recipe-driven approach. If the recipe was perfectly designed, perfectly expressed, and perfectly executed, the outcome would, by definition, be the best possible.

The industry would have architects to create perfect recipes, and line cooks who would execute them. All the industry would need to build were tools to help those line cooks. This led to Rational Rose, UML, SOAP, CORBA, XMLs DTDs, and to some extent SQL, but more painfully Transactional-SQL, stored procedures, and so on.

The paradox of software engineering over the past two decades is that, more often than not, people who specified what they wanted without regard to recipes tended to produce better results, and create more breakthroughs and game changers, than those who built the perfect recipes and accepted whatever resulted from them as perfect.

(I can hear you smug functional-programming folks. Cut it out.)

Today I want to explore the reasons behind this paradox. But first… let’s understand where the two camps came from intuitively.

The intuitive process to get the very best

If you want to design/architect the most perfect system in the world, it makes sense to start with a “greedy algorithm”. At each step, a greedy algorithm chooses the absolute very best technology, methodology and implementation the world has to offer. You vet this thing through to the end. Once you are assured you could not have picked a better component, you use it. As any chef will tell you, a high-quality ingredient cooked with gentle seasoning will outshine a mediocre ingredient hidden by spices.

The second round of iteration is reducing component sets with better-together groups. The perfect burger might be more desirable with mediocre potato chips than with the world’s greatest cheesecake. In this iteration you give up the perfection of one component for the betterment of the whole.

In technology terms, this would mean that perhaps a SQL server and web framework from the same vendor go better together, even if the web server is missing a couple of features here and there.

These decisions, while seemingly cumbersome, would only need to be done once and then you were unbeatable. You could not be said to have missed anything, overlooked anything, or given up any opportunity to provide the very best.

What you now end up with, after following this recipe, is the absolute state-of-the-art solution possible. Nobody else can do better. To quote John Hammond, “spared no expense.

SLA-driven engineering should miss opportunities

The SLA-driven approach of product design defines components only by their attributes — what resources they may consume, what output they must provide, and what side-effects they may have.

The system is then designed based on an assumption of component behaviors. If the system stands up, the components are individually built to meet very-well-understood behaviors.

Simply looking at SLAs it can be determined whether a component is even possible to build. Like a home cook, you taste everything while it is being made, and course-correct. If the overall SLA is 0.2% salt and you’re at 0.1%, you add 0.1% salt in the next step.

The beauty of this is that when SLAs can’t be met, you get the same outcome as recipe-driven design. The teams find the best the world has to offer, and tells you what it can do — you can do no better. No harm done. In the best case however, once they meet your SLA, they have no incentive to go above and beyond. You get mediocre.

This is a utilitarian industrial system. It appears that there is no aesthetic sense, art, or even effort to do anything better. There is no drive to exceed.

SLA-driven engineering should be missing out on all the best and greatest things in the world.

What we expected

The expectation here was obvious to all.

If you told someone to build you a website that works in 20ms, they would do the least possible job to make it happen.

If you gave them the perfect recipe instead, you might have gotten 5ms. Even if you did get 25ms or 30ms, they followed your recipe faithfully, you’ve just hit theoretical perfection.

A recipe is more normalized; it is compact. Once we capture that it will be a SQL server with password protection, security groups, roles, principals, user groups, onboarding and offboarding processes, everything else follows.

On the other hand, SLA-folks are probably paying more and getting less, and what’s worse, they don’t even know by how much. Worse they are also defining de-normalized SLAs. If they fail to mention something, they don’t get it. Aside from salt, can the chef add or remove arbitrary ingredients? Oh the horror!

The Paradox

Paradoxically we observed the complete opposite.

SLA-driven development gave us Actor Models, NoSQL, BigTable, Map-Reduce, AJAX (what “web apps” were called only ten years ago), Mechanical Turk, concurrency, and most recently, Blockchain. Science and tech that was revolutionary, game-changing, elegant, simple, usable, and widely applicable.

Recipe-driven development on the other hand kept burdening us with UML, CORBA, SOAP, XML DTDs, a shadow of Alan Kay’s OOP, Spring, XAML, Symmetric Multiprocessing (SMP), parallelism, etc. More burden. More constraints. More APIs. More prisons.

It was only ten years ago that “an app in your browser” would have led to Java Applets, Flash or Silverlight or some such “native plugin.” Improving ECMAScript to be faster would not be a recipe-driven outcome.

The paradoxical pain is amplified when you consider that recipes didn’t give us provably correct systems, and failed abysmally at empirically correct systems that the SLA camp explicitly checked for. A further irony is that under SLA constraints, provable correctness is valued more — because it makes meeting SLAs easier. The laziness of SLA-driven folks not only seems to help, but is essential!

A lot like social dogma, when recipes don’t produce the best results, we have two instinctive reactions. The first is to dismiss the SLA-solution through some trivialization. The second is to double-down and convince ourselves we didn’t demand enough. We need to demand MORE recipes, more strictness, more perfection.

Understanding the paradox

Let’s try to understand the paradox. It is obvious when we take the time to understand context and circumstances in a social setting where a human is a person with ambitions, emotions, creativity and drive, vs. an ideal setting where “a human” is an emotionless unambitious drone who fills in code in a UML box.

SLAs contextualize your problem

This is the obvious easy one. If you designed web search without SLAs, you would end up with a complex, distributed SQL server with all sorts of scaffolding to make it work under load. If you first looked at SLAs, you would reach a radically new idea : these are key-value pairs. Why do I need B+ trees and complex file systems?

Similarly, if you had to ask, “What is the best system to do parallel index generation?”, the state of the art at the time were PVM/MPI. Having pluggable Map/Reduce operations was considered so cool, every resume in the mid-2000s had to mention these functions (something that is so common and trivial today, your web-developer javascript kiddie is using it to process streams.) The concept of running a million cheap machines was neither the best nor state of the art. 🙂 No recipe would have arrived at it.

SLAs contextualize what is NOT a problem

We all carry this implied baggage with us. Going by the examples above, today none of us consider using a million commodity machines to do something as being unusual. However, with no scientific reason whatsoever, it was considered bad and undesirable until Google came along and made it okay.

Again, having SLAs helped Google focus on what is NOT a problem. What is NOT the objective. What is NOT a goal. This is more important than you’d think. The ability to ignore unnecessary baggage is a powerful tool.

Do you need those factories? Do you need that to be a singleton? Does that have to be global? SLAs fight over-engineering.

SLAs encourage system solutions

What? That sounds crazy! It’s true nonetheless.

When programmers were told they could no longer have buffer overflows, they began working on Rust. When developers were told to solve the dependency problem, they flocked to containers. When developers were told to make apps portable, they made Javascript faster.

On the other hand, recipe-driven development, not being forced and held to an outcome, kept going the complex way. Its solution to buffer overflows was more static analysis, more scanning, more compiler warnings, more errors, more interns, more code reviews. Its solution to the dependency problem was more XML-based dependency manifests and all the baggage of semantic versioning. Its solution to portable apps was more virtual machines, yet one more standard, more plugins, etc.

Without SLAs Polyverse wouldn’t exist

How much do I believe in SLA-driven development? Do I have proof that the approach of first having the goal and working backwards works? Polyverse is living proof.

When we saw the Premera hack, and if we wondered what the BEST in the world is, we’d have thought of even MORE scanning, and MORE detection and MORE AI and MORE neural networks, and maybe some MORE blockchain. Cram in as much as we can.

But when we asked ourselves “If the SLA is to not lose more than ten records a year or bust,” we reached a very different systemic conclusion.

When Equifax happened, if we’d asked ourselves what the state of the art was, we would do MORE code reviews, MORE patching, FASTER patching, etc.

Instead we asked, “What has to happen to make the attack impossible?” We came up with Polyscripting.

Welcome to the most counterintuitive, and yet widely empirically proven, and widely used conclusion in Software Engineering. Picking the best doesn’t give you the best. Picking a goal leads to better than anything the world has to offer today.

Google proved it. Facebook proved it. Apple proved it. Microsoft proved it. They all believe in it. Define goals. Focus on them. Ignore everything else. Change the world.

You’re thinking about scale all wrong

Scale isn’t about large numbers

To hear modern architects, system designers, consultants and inexperienced (but forgivable) developers talk about scale, you’d think every product and service was built to be the next Twitter or Facebook.

Ironically, almost everything they create to be scalable would crash and burn if that actually happened. Even Google and Amazon aren’t an exception to this, at least from time to time. I know this because we run the largest build farm on the planet, and I’m exposed to dirty secrets about pretty much every cloud provider out there.

I want to talk about what scalability really means, why it matters and how to get there. Let’s briefly calibrate on how it’s used today.

Recap of pop-culture scalability

When most tech journalists and architects use the word scale, they use it as a noun. They imagine a very large static system that’s like… really really big in some way or another. Everyone throws out numbers like they’re talking about corn candy — hundreds or thousands of machines, millions of processes, billions of “hits” or transactions per second… you get the idea.

If you can quote a stupidly large number, you’re somehow considered important, impregnable even.

Netflix constitutes 37% of the US internet traffic at peak hours. Microsoft famously runs “a million” servers. Whatsapp moves a billion messages a day.

These numbers are impressive, no doubt. And it’s precisely because they’re impressive that we think of scale as a noun. “At a million servers,” “a billion transactions” or “20% of peak traffic” become defining characteristics of scale.

Why it’s all wrong

Calling something “scalable” simply because it is very, very, very large is like calling something realtime only because it is really, really fast.

Did you know that nowhere in the definition of “real-time systems” does it say “really, really fast?” Real-time systems are meant to be time-deterministic, i.e., they perform some operation in a predictable amount of time.

Having a system go uncontrollably fast can quite frequently be undesirable. You ever played one of those old DOS games on a modern PC? You know how they run insanely fast and are almost unplayable? That’s an example of a non-realtime system. Just because it runs incredibly fast doesn’t make it useful. That it could act with desirable and predictable time characteristics is what would make it a realtime system.

What makes a system realtime is that it works in time that is “real” — a game character’s movements must move in time that is like the real world, the soundtrack of a video must play to match the reality of the video, a rocket’s guidance computer must act in a time that matches the real world. Occasionally a “real time” system might have to execute NO-OPs so that certain actuators are signaled at the “correct time.”

As with much of computing, the definition of scalability depends on the correctness of a system, rather than the size or speed of it.

Scale is a verb, not a noun

The biggest misconception about scale is that it is about being “at scale.” There’s no honor, glory, difficulty or challenge in that, trust me. You want to see a 10K node cluster handling 100M hits per second? Pay me the bill, you got it. I’ll even spin it up over a weekend.

The real challenge, if you’ve ever run any service/product for more than a few months, is the verb “to scale.” To scale from 10 nodes to 100 nodes. To scale from 100 transactions to 500 transactions. To scale from 5 shards to 8 shards.

A scalable system isn’t one that launches some fancy large number and just stupidly sits there. A scalable system is one that scales as a verb, not runs at some arbitrary large number as a noun.

What scalability really means

We commonly use the Big-O notation to define the correctness of behavior in an algorithm. If I were to sort n numbers, a quicksort would perform at worst n-squared operations, and it would take n memory units. A realtime sort would add the additional constraint that it would respond within n minutes on the wall-clock.

Similarly, a scalable system has a predictable Big-O operational complexity to adapt to a certain scale.

Meaning, if you had to build a system to handle n transactions per second, how much complexity do you predict it would take to set it up?

O(n)? O(n-squared)? O(e^n)?

Not really an easy answer is it? Sure we try our best, and we question everything, and we often really worry about our choices at scale.

But are we scale-predictable? Are we scale-deterministic? Can we say that “for 10 million transactions a second, it would take the order of 10 million dollars, and NO MORE, because we are built to scale”?

I run into a dozen or so people who talk about large numbers and huge workloads. But very few people who can grow with my workload, with incremental operational costs.

Scalability doesn’t mean a LOT of servers. Anyone can rent a lot of servers and make them work. Scalability doesn’t mean a lot of transactions. Plenty of things will fetch you a lot of transactions.

Scalability is the Big-O measure of cost for getting to that number, and moreover, the predictability of that cost. The cost can be high, but it needs to be known and predictable.

Some popular things that “don’t scale”

Hopefully this explains why we say some things “don’t scale.” Let’s take the easiest punching bag — any SQL server. I can run a SQL server easy. One that handles a trillion transactions? Quite easy. With 20 shards? That’s easy too. With 4 hot-standby failovers? Not difficult. Geographically diverse failovers? Piece of cake.

However, the cost of going from the one SQL instance I run up to those things? The complexity cost is this jagged step function.

A lot of unpredictable jagged edges

And I’m only looking at a single dimension. Will the client need to be changed? I don’t know. Will that connection string need special attention? Perhaps.

You see, the difficulty/complexity isn’t in actually launching any of those scenarios. The challenge is in having a predictable cost of going from one scenario to a different scenario.

Why should this matter?

I’m advocating for predictable growth in complexity.

Let’s talk about my favorite example — rule-based security systems. Does any rule-based system (IPTables, firewalls, SELinux, AuthZ services) handle 10 million rules? You bet. If you have a static defined system that is architected on blueprints with every rule carefully predefined, it’s possible to create the rules and use them.

Can you smoothly go from 10 rules to 10,000 rules on a smooth slope? Paying complexity as you need it?


This is hardly ever the case. You might think that I’m advocating for a linear growth in complexity. I’m not. I’m advocating for a predictable growth in complexity. I’d be fine with an exponential curve, if I knew it was exponential.

What makes it unscalable, isn’t that the cost is VERY high, or that it is a predictable step function. What makes it truly unscalable is that the complexity is both abruptly and, worse, unpredictably step-py. You will add 10 rules sometimes. Add an 11th rule and it causes a conflict that leads to a 2-day investigation and debugging! You might add 100 nodes with ease. Add an extra node past some IP-range and you’ll be spending weeks with a network-tracer looking for the problem.

An example a bit closer to home. We’ve been looking for a home for Polyverse’s BigBang system — the world’s largest build farm that powers all the scrambling you get transparently and easily.

As an aside, you’ll notice that Polymorphic Linux is “scalable.” What cost/complexity does it take for n nodes? Whether that n be 1, 100, 10,000, 10,000,000? The answer is easily O(n). It is sub-linear in practice, but even in the worst case it is linear. There are no emergency consultants, system designers or architects required to rethink or redesign anything. This is an example of what good scalability looks like.

Behind the scenes of that scalability though, is another story. I’ve spoken to nearly every cloud provider on the planet. I may have missed a few here and there, but I bet if you named a vendor, I’ve spoken to them. They all have “scalable systems,” but what they really have are various systems built to different sizes.


Finding clouds/systems/clusters that can just run really, really large loads is easy. Running those loads is also easy. Finding clouds that are predictable in complexity based on a particular load? Even with all the cloud propaganda, that’s a tough one.

Cybersecurity needs more scalable systems, not systems “at scale”

Scalable systems are not about size, numbers or capability. They have a predictable cost in the dimension of size.

Hopefully I’ve explained what scalable really means. In much the same way that you’d measure a system in number of operations, amount of memory, number of transactions, or expected wall-clock time, a scalable system is operationally predictable in terms of size.

It doesn’t have to be cheap or linear. Merely predictable.

Cybersecurity today is desperately in need of solutions that “can scale,” not ones that merely run “at scale.” We need scalable solutions that encourage MORE security by adding MORE money. Not haphazard, arbitrary and surprising step functions.

Threat Models Suck

They’re everything that’s wrong with cybersecurity

The coffee I’m sipping right now could kill me. You think I jest; but I assure you, if you work backwards from “death”, there is a possible precondition for some very deadly coffee. I just brewed another pot. I survived it to the end of this post. I love living on the edge and ignoring threats.

In cybersecurity though, we love our threat models. We think they’re smart and clever. Intuitively they make sense; in much the same way that a dictatorship and police state make sense, or nearly all the dystopian science fiction AIs make sense. If we programmed the AI to “keep us safe”, it is going to reach the optimal annealed solution: Remain under curfew, work out, stay isolated, don’t interact, and eat healthy synthetic nutritional supplements.

I’ve hated Threat Models since the day I had the displeasure to build one a decade ago. The first, and easy problem with them, is that they are product/solution driven; they’re rhetorical. Any credible threat model should have 80% of threat mitigations as “shrug.” When we don’t have a way to react to a threat, we subconsciously consider it non-existent. Nearly all threat models are playing jeopardy (pun intended).

The second and more subtle problem is they encourage social grandstanding. How do you become a “more serious cybersecurity expert”? By coming up with a crazier threat vector than the last person.

“What if that CIA agent, is in reality, an NSA operative who was placed there by MI6, in order to leak the NOC list to MI6? Have you ever considered that? Now stop isolating that XML deserializer like some kind of pure functional programming evangelist, and let’s do some cybersecurity! Booyah!”

This is why we keep coming up with crazier and crazier tools when overlooking the obvious. I still cringe when someone calls Meltdown and Spectre “timing attacks”. The problem isn’t that the cache is functioning as it is and that you can measure access times. The problem is in shared state. But that doesn’t sound sexy and you can’t sell a 50 year old proven concept. Linus has perhaps the most profound quote in Cybersecurity history: Security problems are primarily just bugs.

Adding jitter to timers, however, is clever, sexy, complicated, protects jobs, creates new jobs, and gets people promoted. Removing shared state across threads/processes is just a design burden that mitigates any impact (and solves a bunch of other operational problems while at it.)

Impact Model

I propose we build Impact Models. Impact Models help prioritize investments, they help us make common sense decisions, but more so, they help us course-correct our decisions, by measuring whether the impact is mitigated/reduced.

In one of my talks aimed at Startups to prioritize security investments correctly, I use this slide.

Why are you investing in Cybersecurity anyway? Is it to run cool technology? Is it to do a lot of “Blockchain”? Or is it to reduce/mitigate impact to business?

Just because something is technically a threat, doesn’t imply it has an appreciable impact. You’ll notice in the slide above, if I were to lose an encrypted laptop, it’d be incredibly inconvenient, painful and frustrating. However, Polyverse as an entity would suffer very little. How or Why I might lose the said laptop becomes less of a concern, since I’ve mitigated the impact of losing it.

This applies to our website too. We try to follow best practices. But beyond a certain point, we don’t consider what happens if AWS were to be hacked, and these legendary AWS-hackers were interested in defacing the website of a scrappy little startup above all else. Would it annoy me? You bet. Would it be inconvenient? Sure. Would it get a snarky little headline in the tech press? Absolutely. But would it leak a 150 million people’s PII? Not really.

Another benefit of impact modeling, is that it can present potentially non-“cybersecurity” solutions. I usually present this slide which is similar to Linus’s quote.

Focussing on preventing a threat is important, and you should do it for good hygiene. Reducing the impact of that threat breaking through anyway, gives you a deeper sense of comfort.

We don’t live our lives based on threat models. We live them based on impact models. You’ll find that they will bring a great deal of clarity in your cybersecurity decision making, they’ll help you prioritize what comes first and what comes next. They’ll equip you to ask the right questions when purchasing and implementing technology. Most of all — they’ll help you get genuine buy-in from your team. Providing concrete data and justification motivates people far more than mandates and compliance.

“My threats are already sorted by impact!”, you say

I knew this would come up. Indeed every threat model does have three columns: Threat Vector, Impact, Mitigation.

Without impact you wouldn’t be able to pitch the threat seriously. InfoSec teams are nothing, if not good at visualizing world-ending scenarios. Much like my coffee’s purported impact of “death” got you this far, as opposed to “mild dehydration”.

Threat Model

The problem is, read that Mitigation column and ask yourself what it’s mitigating. Is it mitigating the Threat Vector, or is it mitigating the Impact?

This is not a syntactical difference, it’s a semantic one. Multiple threats can have the same impact. Mitigating the impact, can remove all of them — even if some new threat is announced, which would have led to the same impact, you remain unconcerned. Your reaction is, “no change.”

Impact Model

In short, if Equifax had changed the “if” to “when”, they’d have had a much smaller problem to deal with.

Wishing you all a reduced impact.