Can we just cut to Infrastructure-As-Declarative-Code?

The missing link between Declarative Infra and Infra-as-Code

Everyone who is allured by the promise of Declarative Infrastructure, “declare what you want, and don’t worry how it happens”, eventually seems to end up at half-baked verbose clumsy templating.

Nobody looks at 50 yaml files (with templates even) and can read a crisp declaration: this is a secure, throttled, HA wordpress instance! It’s so clear and obvious!

A long-overdue reaction to verbose templating is the new allure of — Infrastructure-As-Code. Pulumi and the AWS CDK offer these beautiful compact clean parameterized abstractions (functions, objects, etc.)

There is discomfort though from the Declarative camp — was the promise all wrong? Was the implementation wrong? What gives? We know we should want it… but the imperative definitions really are readable.

I’m here to tell you the world is offering us a false choice — always has. What if I told we could have “Declarative Systems” and “Infrastructure-As-Code” all in one? Let me explain…

We’re confusing “Static Configuration” with “Declarative Programming”

I’m going to go after (allegedly) the “worlds first system that allows you to declaratively provision infrastructure”.

Kubernetes YAML is not and never was Declarative Programming! It was a magic trick you bought into because of all the grandiosity that came with the sales pitch.

Kubernetes is a Statically Configured system. What were DOSs .ini files in the 80s, /etc/*.conf files in the 90s, are the YAML for Kubernetes. When we kubectl apply we are writing bunch of strings to a KV store with pomp and grandiosity that made us believe we were doing “Declarative” something something. Don’t hate me just yet — because if we accept this, we can build a world that is oh so much more beautiful!

Infrastructure doesn’t have an Assembly Language

Writing a “Higher Level Language” on top of Kubernetes’ alleged “Assembly Language” makes about as much sense as writing C using regular expressions.

Even if Kubernetes were the kernel, YAML is NOT the Assembly Language, because it is missing The Language. Most charitably, the Kubernetes resource model would be “Registers” for a language that doesn’t exist; and they’re really just data constants not even registers.

You know you can write a regular expression library in your favorite programming language — Javascript, C#, Lua, Elm, Ballerina, whatever. Can you write your favorite programming language in Regular Expressions?

Now compare Assembly Language to Java, Javascript, C#, C++, Elm, Go, Rust, etc. You can write any of these in any of the other ones.

That’s the difference — Assembly Language is not a “Lesser Language”, it is a “Lower Level Language”. It can do everything the others can do — no more, no less.

Writing a “Higher Level Language” on top of Kubernete’s alleged “Assembly Language” makes about as much sense as writing Java using regular expressions.

This is the essence of why templating looks awkward, and Infra-As-Code looks better, but… feels like sacrificing the very promise that attracted you to Declarative systems in the first place.

What you want is a Declarative System + The Language => Declaractive Programming!

Declarative Programming is neither new, nor do you need to have planet-scale problems for it to be useful. If you were tempted by Declarative Infra for the promise of describing your complete apps in a repeatable, portable, and all-in-one-place style, what you wanted was: purity/idempotence, parameters, and closures.

You were promised a Declarative Assembly Language, but you were given Data Registers.

Imperative programming gives you better abstractions than templating, but it still doesn’t understand them — you are still expressing HOW you want Data generated, not WHAT you want as an overall Goal.

There is a better way! There is a Declarative Way of writing programs, where Predicates are what would be loops and conditionals in imperative programs. Relations are what would be functions in imperative programs. Facts are what would be data in imperative programs. Assertions are what would be tests in imperative programs. If you haven’t already, Play with Prolog.

Declarations are not these dead blobs of static YAML! They’re living breathing encapsulations of understanding and knowledge that makes the platform grow and be better!

A Declaratively Programmed Infrastructure Platform

I wrote a mini-spec on twitter and I want to get past the theory and describe what a proper Declarative App/Infra Platform would look like (whether in/on/above/under Kubernetes or not.)


Desires are what you express. The system doesn’t change them, touch them, modify them, mutate them, etc. Strictly no admission webook mutation api aggregation server controller whatsoever on a desire. You can express anything in the world you want. You can Desire to be paid 1 million dollars. You can Desire a Ferrari.

In today’s world, what you kubectl apply would always be a desire. It represents nothing more nothing less, and nobody gets to change it, modify it or argue that you shouldn’t want what you want.


Facts are things the system simply knows to be true. A fact would be what is the /status sub-resource today or a Node. No more weird ugly resource/sub-resource bullshit that everyone is modifying and awkwardly versioning with ever-complex merge algorithms. Just straight up “Fact”. I “Desire a Pod” so the system gave me the “Fact of a Pod”. Two independent first-class entities.


Predicates are “knowledge” the system has as it learns. A Predicate applies constraints, adds information, removes information, but in a DECLARATIVE way.

For example, today if you knew all Scottish Sheep are Blue, you can’t declare that knowledge to any Declarative Infrastructure. You have to enter EACH sheep in Scotland as being blue either through templating or through a “real language”. Not only is one verbose, one clumsy and one non-declarative, the real travesty is that valuable knowledge was lost that others cannot benefit from. Nobody else can read the code and infer that you really wanted to state that “All scottish sheep are blue.”

In Declarative Programming, though, you can have it both ways! Enter Predicates. You can at once know whether any particular sheep is blue, and also remember the general rule that all Scottish Sheep are Blue so others can benefit from it. You don’t give one up for the other.

More concretely let me write you some predicates that the system would understand first-class. These aren’t some clumsy custom controllers responding to a Shared Informer using a ClientSet with CodeGen’d Golang API packages and iteratively using level-triggered reactions to set annotations. Ugh so many words! No no, these are fundamental Declarations to the Assembly Language of the system! This is the system’s Language not static configuration.

  • All Pods not fronted by a service -> Ready to Garbage collect

(Note that I didn’t write all those Pods should be marked as Ready to Garbage Collect. They ARE ready to garbage collect — you don’t tell the system WHAT to do, simply what you KNOW.)

  • All Services with HTTP endpoint -> Invalid
  • All Pods with older-than-30 days SSL keys -> Invalid

Once declared, the predicates teach and improve the system itself. The system understands them. Not in operators or controllers or some third-party templating place.


Finally, what makes all this work magic is Relations. Relations are what teach the system how to accomplish Desires. A Relation teaches it how to take a Desire and relate it to more Desires and/or Facts.

The system simply breaks down high-level Desires, until all Desires have a Relation to a Fact. Then it manifests those Facts. It can also tell you when a Desire cannot be met and why. Is a Predicate getting in the way? Is a resource depleted?

Let me illustrate:

  • Relation: Desire: Exposed Service(ip) -> Fact: IP Open to Internet
  • Relation: Desire: Secure Service(domain) -> Desire: Secure Proxy(domain, ip_of_insecure_service, domain_cert) + Desire: Insecure Service + Desire: Domain Cert(domain)
  • Relation: Desire: Secure Proxy(domain, ip, cert) -> Fact: Nginx Proxy(domain,ip,cert)
  • Relation: Desire: Insecure Service -> Fact: Insecure Service
  • Relation: Desire: Domain Cert(domain) -> Fact: Domain Cert(domain)

Now I collapsed a few steps here, but it doesn’t matter. You get the point on how to Relate an Exposed Secure Service, to an Insecure Service and Cert Generator.

This is all we need. Let’s bring it all together!

A System with 1000+ Microservices at Planet Scale described As Code 100% Declaratively

Let’s look at how the cluster comes together:

  • Vendors/Sysadmins/Whomever provide Relations that convert Desires to Facts. This is below the app-interface you are supposed to see. If a Desire doesn’t become Fact, the Vendor/Admin/Operator is paged. A clear boundary.
  • InfoSec/Business/Compliance injects predicates. They define what constraints they want. What can and cannot be true.
  • App-Developers provide a Desire. That is all they do. They get back if it can or cannot be met. They get back if it cannot be met, WHY it cannot be met — a predicate got in the way, missing Relation i.e. don’t know how to break down a certain Desire to Facts, A Desire->Fact relation errored and here’s the error details.

Now we have a Declaratively Programmed Infrastructure — where knowledge is not lost, we get full programming, we get full declarative-ness, and we get even more things:

  1. We can ask WHY something exists. Who desired it? Are they allowed to desire it?
  2. We assert things and make overarching statements. We can say “All Pods in EU always store data in EU”. We can simply MAKE such a powerful statement.
  3. We can add constraints and not scan/validate/audit. If a constraint exists, it simply makes certain Facts untrue, i.e. they are unmanifestable.
  4. We can compose higher Desires over lower Desires.

If I were sold THIS Declarative system with the lowest Assembly Language today, I would buy it.

Whether I’ll ever get to use one, I don’t know — but I sure hope so.

You’re thinking about scale all wrong

Scale isn’t about large numbers

To hear modern architects, system designers, consultants and inexperienced (but forgivable) developers talk about scale, you’d think every product and service was built to be the next Twitter or Facebook.

Ironically, almost everything they create to be scalable would crash and burn if that actually happened. Even Google and Amazon aren’t an exception to this, at least from time to time. I know this because we run the largest build farm on the planet, and I’m exposed to dirty secrets about pretty much every cloud provider out there.

I want to talk about what scalability really means, why it matters and how to get there. Let’s briefly calibrate on how it’s used today.

Recap of pop-culture scalability

When most tech journalists and architects use the word scale, they use it as a noun. They imagine a very large static system that’s like… really really big in some way or another. Everyone throws out numbers like they’re talking about corn candy — hundreds or thousands of machines, millions of processes, billions of “hits” or transactions per second… you get the idea.

If you can quote a stupidly large number, you’re somehow considered important, impregnable even.

Netflix constitutes 37% of the US internet traffic at peak hours. Microsoft famously runs “a million” servers. Whatsapp moves a billion messages a day.

These numbers are impressive, no doubt. And it’s precisely because they’re impressive that we think of scale as a noun. “At a million servers,” “a billion transactions” or “20% of peak traffic” become defining characteristics of scale.

Why it’s all wrong

Calling something “scalable” simply because it is very, very, very large is like calling something realtime only because it is really, really fast.

Did you know that nowhere in the definition of “real-time systems” does it say “really, really fast?” Real-time systems are meant to be time-deterministic, i.e., they perform some operation in a predictable amount of time.

Having a system go uncontrollably fast can quite frequently be undesirable. You ever played one of those old DOS games on a modern PC? You know how they run insanely fast and are almost unplayable? That’s an example of a non-realtime system. Just because it runs incredibly fast doesn’t make it useful. That it could act with desirable and predictable time characteristics is what would make it a realtime system.

What makes a system realtime is that it works in time that is “real” — a game character’s movements must move in time that is like the real world, the soundtrack of a video must play to match the reality of the video, a rocket’s guidance computer must act in a time that matches the real world. Occasionally a “real time” system might have to execute NO-OPs so that certain actuators are signaled at the “correct time.”

As with much of computing, the definition of scalability depends on the correctness of a system, rather than the size or speed of it.

Scale is a verb, not a noun

The biggest misconception about scale is that it is about being “at scale.” There’s no honor, glory, difficulty or challenge in that, trust me. You want to see a 10K node cluster handling 100M hits per second? Pay me the bill, you got it. I’ll even spin it up over a weekend.

The real challenge, if you’ve ever run any service/product for more than a few months, is the verb “to scale.” To scale from 10 nodes to 100 nodes. To scale from 100 transactions to 500 transactions. To scale from 5 shards to 8 shards.

A scalable system isn’t one that launches some fancy large number and just stupidly sits there. A scalable system is one that scales as a verb, not runs at some arbitrary large number as a noun.

What scalability really means

We commonly use the Big-O notation to define the correctness of behavior in an algorithm. If I were to sort n numbers, a quicksort would perform at worst n-squared operations, and it would take n memory units. A realtime sort would add the additional constraint that it would respond within n minutes on the wall-clock.

Similarly, a scalable system has a predictable Big-O operational complexity to adapt to a certain scale.

Meaning, if you had to build a system to handle n transactions per second, how much complexity do you predict it would take to set it up?

O(n)? O(n-squared)? O(e^n)?

Not really an easy answer is it? Sure we try our best, and we question everything, and we often really worry about our choices at scale.

But are we scale-predictable? Are we scale-deterministic? Can we say that “for 10 million transactions a second, it would take the order of 10 million dollars, and NO MORE, because we are built to scale”?

I run into a dozen or so people who talk about large numbers and huge workloads. But very few people who can grow with my workload, with incremental operational costs.

Scalability doesn’t mean a LOT of servers. Anyone can rent a lot of servers and make them work. Scalability doesn’t mean a lot of transactions. Plenty of things will fetch you a lot of transactions.

Scalability is the Big-O measure of cost for getting to that number, and moreover, the predictability of that cost. The cost can be high, but it needs to be known and predictable.

Some popular things that “don’t scale”

Hopefully this explains why we say some things “don’t scale.” Let’s take the easiest punching bag — any SQL server. I can run a SQL server easy. One that handles a trillion transactions? Quite easy. With 20 shards? That’s easy too. With 4 hot-standby failovers? Not difficult. Geographically diverse failovers? Piece of cake.

However, the cost of going from the one SQL instance I run up to those things? The complexity cost is this jagged step function.

A lot of unpredictable jagged edges

And I’m only looking at a single dimension. Will the client need to be changed? I don’t know. Will that connection string need special attention? Perhaps.

You see, the difficulty/complexity isn’t in actually launching any of those scenarios. The challenge is in having a predictable cost of going from one scenario to a different scenario.

Why should this matter?

I’m advocating for predictable growth in complexity.

Let’s talk about my favorite example — rule-based security systems. Does any rule-based system (IPTables, firewalls, SELinux, AuthZ services) handle 10 million rules? You bet. If you have a static defined system that is architected on blueprints with every rule carefully predefined, it’s possible to create the rules and use them.

Can you smoothly go from 10 rules to 10,000 rules on a smooth slope? Paying complexity as you need it?

This is hardly ever the case. You might think that I’m advocating for a linear growth in complexity. I’m not. I’m advocating for a predictable growth in complexity. I’d be fine with an exponential curve, if I knew it was exponential.

What makes it unscalable, isn’t that the cost is VERY high, or that it is a predictable step function. What makes it truly unscalable is that the complexity is both abruptly and, worse, unpredictably step-py. You will add 10 rules sometimes. Add an 11th rule and it causes a conflict that leads to a 2-day investigation and debugging! You might add 100 nodes with ease. Add an extra node past some IP-range and you’ll be spending weeks with a network-tracer looking for the problem.

An example a bit closer to home. We’ve been looking for a home for Polyverse’s BigBang system — the world’s largest build farm that powers all the scrambling you get transparently and easily.

As an aside, you’ll notice that Polymorphic Linux is “scalable.” What cost/complexity does it take for n nodes? Whether that n be 1, 100, 10,000, 10,000,000? The answer is easily O(n). It is sub-linear in practice, but even in the worst case it is linear. There are no emergency consultants, system designers or architects required to rethink or redesign anything. This is an example of what good scalability looks like.

Behind the scenes of that scalability though, is another story. I’ve spoken to nearly every cloud provider on the planet. I may have missed a few here and there, but I bet if you named a vendor, I’ve spoken to them. They all have “scalable systems,” but what they really have are various systems built to different sizes.

Finding clouds/systems/clusters that can just run really, really large loads is easy. Running those loads is also easy. Finding clouds that are predictable in complexity based on a particular load? Even with all the cloud propaganda, that’s a tough one.

Cybersecurity needs more scalable systems, not systems “at scale”

Scalable systems are not about size, numbers or capability. They have a predictable cost in the dimension of size.

Hopefully I’ve explained what scalable really means. In much the same way that you’d measure a system in number of operations, amount of memory, number of transactions, or expected wall-clock time, a scalable system is operationally predictable in terms of size.

It doesn’t have to be cheap or linear. Merely predictable.

Cybersecurity today is desperately in need of solutions that “can scale,” not ones that merely run “at scale.” We need scalable solutions that encourage MORE security by adding MORE money. Not haphazard, arbitrary and surprising step functions.

Convert between Docker Registry Credentials, K8s Image Pull Secrets, and config.json live

Generating registry secrets for Kubernetes is cumbersome. Extracting creds or updating the secret is annoying. Generating config.json is painful. But we need to do it all the time!

I frequently generate service accounts in our private image hub for various tests. Generating config.json is cumbersome. Injecting that into a kubernetes cluster as a registry secret (for use as a ImagePullSecrets on a Pod) is hard. Single-character errors can lead to a cryptic ImagePullErr. On a swarm, it’s worse. The tasks just won’t spin up mysteriously.

It is even more painful when working with customers, and you are behind an email-wall at worst, or a slack-wall at best. Both terrible for preserving formatting.

So I whipped up this tri-directional live converter. You can add/edit credentials on the left-most side, and a config.json as well as Kubernetes Secret is generated live (which you can save to a file, and inject using:

kubectl create -f <savedfile.yaml>

However, you may also post a secret you obtained out of Kubernetes by running:

kubectl get secret <secret> --output=yaml

Or you can edit config.json directly. This is incredibly useful when you have a secret that contains authorizations to say 10 registries, but you want to revoke 3 of those.

You can remove those registry entries out of config.json, and the kubernetes yaml will be updated automatically to reflect that.

Spread the word!