Amazon is strip-mining Consultants, not Open Source

Software Development is a great business model; Software-installation never was

If you’ve been following the latest incarnation of OSS licensing drama, you’ll see hints of the BLA (Bad Linux Advocacy) fervor of the late-90’s/early-2000’s, the OpenStack drama of the early 2010’s, and the Kubernetes drama being played out today.

Like my therapist says about all my terrible relationships — if everyone turns out to be toxic in the same manner, ask what was common across all of them?

Microsoft and Windows may have lost. Linux may have won. If you weren’t around 15 years ago, Linux didn’t create the jobs everyone thought it would. Supposedly, if Linux was capable, but needed babysitting, there’d be a cottage industry of Apache-configurators, NIS/LDAP maintainers, NFS-managers, etc. If I had to name the category, it’d be the DBA. Those jobs never materialized, and Red Hat became the scapegoat for making it too simple.

On the other hand, open source jobs and businesses are thriving. Building software is making all the money in the world. Employers are complaining about being ghosted — candidates who sign offers but never show up, candidates who tell you they quit by disappearing, and so on. More people are being paid more money to develop software in the open. Your Github account is your resume.

So what exactly is the deal with community licenses, Amazon hosting open source “without giving back”, etc. Are we really sore that Amazon isn’t giving back, or are we sore that they figured out the trap? I’ll come to this in a bit when we discuss OpenStack, the Amazon-killer from 2014.

The Consultant Trap

Money is the cheapest resource you have.

Do you remember OpenStack? It was absolutely horrible — I’ve run it.

OpenStack had to accomplish two mutually exclusive tasks at the same time.

If you were a >$1billion dollar revenue company with 100 dedicated people, it had be easy for you to run. If you were less than that, it had to be difficult.

If it were trivial, it would destroy Amazon, but nobody would pay YOU. If it were too difficult but desirable, Amazon had all the capital to do it better than you.

It had to allow IBM, Rackspace, VMWare, etc. to take customers away from Amazon, while all at once, preventing a thousand others from doing the same.

OpenStack upstream actively fought attempts to make it simple, usable or obvious, because that was going to be why you’d want a vendor. If you dare me, I’ll hunt down pull requests, bug reports and issues to prove the point.

It’s a trap, because it is a consciously maintained gap (like a trapdoor function in cryptography), where the key looks like you. If you fill in that gap, the thing works. If you are not present, it doesn’t. That’s how you distribute a free product, but ensure you will be paid.

Why Amazon?

Can you imagine Microsoft or Google just simplifying something because you want it? Something that didn’t come from their billion-dollar research divisions based on running things at “mission-critical enterprise” for the former, or “galaxy scale” for the latter? They will pontificate around the Parthenon in Togas in order to figure out the true meaning of what “a database” is, and what higher-order manner the world should see the data.

It’d also become a “promotion vehicle” to tack on pet-projects (that’s what really killed the Zune, Windows Phone, Windows 8 and so on.) It’d have to have vision, strategy, synergy, dynamism, a new programming paradigm for someone’s promotion, a new configuration language for another promotion, etc.

Amazon scares the crap out of me. They’ll spin up a team of ten people, give them all the money and ask them to solve the damn problem in the simplest, stupidest way possible; promotions, titles, orgs be damned. Bezos has heard that a customer wants a cache. Make the cache happen or go to hell. You can see them doing that comfortably.

So how do you beat Amazon?

Make it EASY!

Have you learned nothing? I have two examples: WordPress, and RedHat. Amazon would never host WordPress. What can they do except make it painful, complicated, behind that terrible UX of theirs? What it doesn’t do, is require a painful number of consultants for getting started. Despite being easy, the WordPress hosting business is thriving!

Red Hat is similar. It works. Oracle copied the code base and slashed the price in half; got nowhere. If you really worry about complex kernel patches, you’re going to pay for Red Hat’s in-house expertise. If you don’t need them, half-the-price doesn’t do much.

Every project should learn from this. Vendors are salivating over the opportunities in Kubernetes, Service Meshes, Storage Plugins, Network plugins, and that will be their downfall. Ironically, if all of this were trivial to run, they would still get paid to host/manage it, and perhaps by more people. Amazon gets in a bind: They can’t “manage” a service that doesn’t need management.

If your business model is maintaining that perfect TrapDoor, you’re going to be strip-mined. License be damned.

The best way to make something toxic for Amazon is to make it so goddamn trivially consumable, that the AWS console and AWS CLI feel like terrible things to ever have to deal with.

On the other hand, making open source easy, simple, consumable and useful will continue to find those who will pay for hosting and management. You will continue to get paid for new feature writing.

Amazon will trip over themselves making it look uglier and stupider with their VPCs and Subnets and IAMs and CloudFormations and what not. That is how you bring Amazon down.

You’re thinking about scale all wrong

Scale isn’t about large numbers

To hear modern architects, system designers, consultants and inexperienced (but forgivable) developers talk about scale, you’d think every product and service was built to be the next Twitter or Facebook.

Ironically, almost everything they create to be scalable would crash and burn if that actually happened. Even Google and Amazon aren’t an exception to this, at least from time to time. I know this because we run the largest build farm on the planet, and I’m exposed to dirty secrets about pretty much every cloud provider out there.

I want to talk about what scalability really means, why it matters and how to get there. Let’s briefly calibrate on how it’s used today.

Recap of pop-culture scalability

When most tech journalists and architects use the word scale, they use it as a noun. They imagine a very large static system that’s like… really really big in some way or another. Everyone throws out numbers like they’re talking about corn candy — hundreds or thousands of machines, millions of processes, billions of “hits” or transactions per second… you get the idea.

If you can quote a stupidly large number, you’re somehow considered important, impregnable even.

Netflix constitutes 37% of the US internet traffic at peak hours. Microsoft famously runs “a million” servers. Whatsapp moves a billion messages a day.

These numbers are impressive, no doubt. And it’s precisely because they’re impressive that we think of scale as a noun. “At a million servers,” “a billion transactions” or “20% of peak traffic” become defining characteristics of scale.

Why it’s all wrong

Calling something “scalable” simply because it is very, very, very large is like calling something realtime only because it is really, really fast.

Did you know that nowhere in the definition of “real-time systems” does it say “really, really fast?” Real-time systems are meant to be time-deterministic, i.e., they perform some operation in a predictable amount of time.

Having a system go uncontrollably fast can quite frequently be undesirable. You ever played one of those old DOS games on a modern PC? You know how they run insanely fast and are almost unplayable? That’s an example of a non-realtime system. Just because it runs incredibly fast doesn’t make it useful. That it could act with desirable and predictable time characteristics is what would make it a realtime system.

What makes a system realtime is that it works in time that is “real” — a game character’s movements must move in time that is like the real world, the soundtrack of a video must play to match the reality of the video, a rocket’s guidance computer must act in a time that matches the real world. Occasionally a “real time” system might have to execute NO-OPs so that certain actuators are signaled at the “correct time.”

As with much of computing, the definition of scalability depends on the correctness of a system, rather than the size or speed of it.

Scale is a verb, not a noun

The biggest misconception about scale is that it is about being “at scale.” There’s no honor, glory, difficulty or challenge in that, trust me. You want to see a 10K node cluster handling 100M hits per second? Pay me the bill, you got it. I’ll even spin it up over a weekend.

The real challenge, if you’ve ever run any service/product for more than a few months, is the verb “to scale.” To scale from 10 nodes to 100 nodes. To scale from 100 transactions to 500 transactions. To scale from 5 shards to 8 shards.

A scalable system isn’t one that launches some fancy large number and just stupidly sits there. A scalable system is one that scales as a verb, not runs at some arbitrary large number as a noun.

What scalability really means

We commonly use the Big-O notation to define the correctness of behavior in an algorithm. If I were to sort n numbers, a quicksort would perform at worst n-squared operations, and it would take n memory units. A realtime sort would add the additional constraint that it would respond within n minutes on the wall-clock.

Similarly, a scalable system has a predictable Big-O operational complexity to adapt to a certain scale.

Meaning, if you had to build a system to handle n transactions per second, how much complexity do you predict it would take to set it up?

O(n)? O(n-squared)? O(e^n)?

Not really an easy answer is it? Sure we try our best, and we question everything, and we often really worry about our choices at scale.

But are we scale-predictable? Are we scale-deterministic? Can we say that “for 10 million transactions a second, it would take the order of 10 million dollars, and NO MORE, because we are built to scale”?

I run into a dozen or so people who talk about large numbers and huge workloads. But very few people who can grow with my workload, with incremental operational costs.

Scalability doesn’t mean a LOT of servers. Anyone can rent a lot of servers and make them work. Scalability doesn’t mean a lot of transactions. Plenty of things will fetch you a lot of transactions.

Scalability is the Big-O measure of cost for getting to that number, and moreover, the predictability of that cost. The cost can be high, but it needs to be known and predictable.

Some popular things that “don’t scale”

Hopefully this explains why we say some things “don’t scale.” Let’s take the easiest punching bag — any SQL server. I can run a SQL server easy. One that handles a trillion transactions? Quite easy. With 20 shards? That’s easy too. With 4 hot-standby failovers? Not difficult. Geographically diverse failovers? Piece of cake.

However, the cost of going from the one SQL instance I run up to those things? The complexity cost is this jagged step function.

A lot of unpredictable jagged edges

And I’m only looking at a single dimension. Will the client need to be changed? I don’t know. Will that connection string need special attention? Perhaps.

You see, the difficulty/complexity isn’t in actually launching any of those scenarios. The challenge is in having a predictable cost of going from one scenario to a different scenario.

Why should this matter?

I’m advocating for predictable growth in complexity.

Let’s talk about my favorite example — rule-based security systems. Does any rule-based system (IPTables, firewalls, SELinux, AuthZ services) handle 10 million rules? You bet. If you have a static defined system that is architected on blueprints with every rule carefully predefined, it’s possible to create the rules and use them.

Can you smoothly go from 10 rules to 10,000 rules on a smooth slope? Paying complexity as you need it?

This is hardly ever the case. You might think that I’m advocating for a linear growth in complexity. I’m not. I’m advocating for a predictable growth in complexity. I’d be fine with an exponential curve, if I knew it was exponential.

What makes it unscalable, isn’t that the cost is VERY high, or that it is a predictable step function. What makes it truly unscalable is that the complexity is both abruptly and, worse, unpredictably step-py. You will add 10 rules sometimes. Add an 11th rule and it causes a conflict that leads to a 2-day investigation and debugging! You might add 100 nodes with ease. Add an extra node past some IP-range and you’ll be spending weeks with a network-tracer looking for the problem.

An example a bit closer to home. We’ve been looking for a home for Polyverse’s BigBang system — the world’s largest build farm that powers all the scrambling you get transparently and easily.

As an aside, you’ll notice that Polymorphic Linux is “scalable.” What cost/complexity does it take for n nodes? Whether that n be 1, 100, 10,000, 10,000,000? The answer is easily O(n). It is sub-linear in practice, but even in the worst case it is linear. There are no emergency consultants, system designers or architects required to rethink or redesign anything. This is an example of what good scalability looks like.

Behind the scenes of that scalability though, is another story. I’ve spoken to nearly every cloud provider on the planet. I may have missed a few here and there, but I bet if you named a vendor, I’ve spoken to them. They all have “scalable systems,” but what they really have are various systems built to different sizes.

Finding clouds/systems/clusters that can just run really, really large loads is easy. Running those loads is also easy. Finding clouds that are predictable in complexity based on a particular load? Even with all the cloud propaganda, that’s a tough one.

Cybersecurity needs more scalable systems, not systems “at scale”

Scalable systems are not about size, numbers or capability. They have a predictable cost in the dimension of size.

Hopefully I’ve explained what scalable really means. In much the same way that you’d measure a system in number of operations, amount of memory, number of transactions, or expected wall-clock time, a scalable system is operationally predictable in terms of size.

It doesn’t have to be cheap or linear. Merely predictable.

Cybersecurity today is desperately in need of solutions that “can scale,” not ones that merely run “at scale.” We need scalable solutions that encourage MORE security by adding MORE money. Not haphazard, arbitrary and surprising step functions.