You’re thinking about scale all wrong

Scale isn’t about large numbers

To hear modern architects, system designers, consultants and inexperienced (but forgivable) developers talk about scale, you’d think every product and service was built to be the next Twitter or Facebook.

Ironically, almost everything they create to be scalable would crash and burn if that actually happened. Even Google and Amazon aren’t an exception to this, at least from time to time. I know this because we run the largest build farm on the planet, and I’m exposed to dirty secrets about pretty much every cloud provider out there.

I want to talk about what scalability really means, why it matters and how to get there. Let’s briefly calibrate on how it’s used today.

Recap of pop-culture scalability

When most tech journalists and architects use the word scale, they use it as a noun. They imagine a very large static system that’s like… really really big in some way or another. Everyone throws out numbers like they’re talking about corn candy — hundreds or thousands of machines, millions of processes, billions of “hits” or transactions per second… you get the idea.

If you can quote a stupidly large number, you’re somehow considered important, impregnable even.

Netflix constitutes 37% of the US internet traffic at peak hours. Microsoft famously runs “a million” servers. Whatsapp moves a billion messages a day.

These numbers are impressive, no doubt. And it’s precisely because they’re impressive that we think of scale as a noun. “At a million servers,” “a billion transactions” or “20% of peak traffic” become defining characteristics of scale.

Why it’s all wrong

Calling something “scalable” simply because it is very, very, very large is like calling something realtime only because it is really, really fast.

Did you know that nowhere in the definition of “real-time systems” does it say “really, really fast?” Real-time systems are meant to be time-deterministic, i.e., they perform some operation in a predictable amount of time.

Having a system go uncontrollably fast can quite frequently be undesirable. You ever played one of those old DOS games on a modern PC? You know how they run insanely fast and are almost unplayable? That’s an example of a non-realtime system. Just because it runs incredibly fast doesn’t make it useful. That it could act with desirable and predictable time characteristics is what would make it a realtime system.

What makes a system realtime is that it works in time that is “real” — a game character’s movements must move in time that is like the real world, the soundtrack of a video must play to match the reality of the video, a rocket’s guidance computer must act in a time that matches the real world. Occasionally a “real time” system might have to execute NO-OPs so that certain actuators are signaled at the “correct time.”

As with much of computing, the definition of scalability depends on the correctness of a system, rather than the size or speed of it.

Scale is a verb, not a noun

The biggest misconception about scale is that it is about being “at scale.” There’s no honor, glory, difficulty or challenge in that, trust me. You want to see a 10K node cluster handling 100M hits per second? Pay me the bill, you got it. I’ll even spin it up over a weekend.

The real challenge, if you’ve ever run any service/product for more than a few months, is the verb “to scale.” To scale from 10 nodes to 100 nodes. To scale from 100 transactions to 500 transactions. To scale from 5 shards to 8 shards.

A scalable system isn’t one that launches some fancy large number and just stupidly sits there. A scalable system is one that scales as a verb, not runs at some arbitrary large number as a noun.

What scalability really means

We commonly use the Big-O notation to define the correctness of behavior in an algorithm. If I were to sort n numbers, a quicksort would perform at worst n-squared operations, and it would take n memory units. A realtime sort would add the additional constraint that it would respond within n minutes on the wall-clock.

Similarly, a scalable system has a predictable Big-O operational complexity to adapt to a certain scale.

Meaning, if you had to build a system to handle n transactions per second, how much complexity do you predict it would take to set it up?

O(n)? O(n-squared)? O(e^n)?

Not really an easy answer is it? Sure we try our best, and we question everything, and we often really worry about our choices at scale.

But are we scale-predictable? Are we scale-deterministic? Can we say that “for 10 million transactions a second, it would take the order of 10 million dollars, and NO MORE, because we are built to scale”?

I run into a dozen or so people who talk about large numbers and huge workloads. But very few people who can grow with my workload, with incremental operational costs.

Scalability doesn’t mean a LOT of servers. Anyone can rent a lot of servers and make them work. Scalability doesn’t mean a lot of transactions. Plenty of things will fetch you a lot of transactions.

Scalability is the Big-O measure of cost for getting to that number, and moreover, the predictability of that cost. The cost can be high, but it needs to be known and predictable.

Some popular things that “don’t scale”

Hopefully this explains why we say some things “don’t scale.” Let’s take the easiest punching bag — any SQL server. I can run a SQL server easy. One that handles a trillion transactions? Quite easy. With 20 shards? That’s easy too. With 4 hot-standby failovers? Not difficult. Geographically diverse failovers? Piece of cake.

However, the cost of going from the one SQL instance I run up to those things? The complexity cost is this jagged step function.

A lot of unpredictable jagged edges

And I’m only looking at a single dimension. Will the client need to be changed? I don’t know. Will that connection string need special attention? Perhaps.

You see, the difficulty/complexity isn’t in actually launching any of those scenarios. The challenge is in having a predictable cost of going from one scenario to a different scenario.

Why should this matter?

I’m advocating for predictable growth in complexity.

Let’s talk about my favorite example — rule-based security systems. Does any rule-based system (IPTables, firewalls, SELinux, AuthZ services) handle 10 million rules? You bet. If you have a static defined system that is architected on blueprints with every rule carefully predefined, it’s possible to create the rules and use them.

Can you smoothly go from 10 rules to 10,000 rules on a smooth slope? Paying complexity as you need it?

This is hardly ever the case. You might think that I’m advocating for a linear growth in complexity. I’m not. I’m advocating for a predictable growth in complexity. I’d be fine with an exponential curve, if I knew it was exponential.

What makes it unscalable, isn’t that the cost is VERY high, or that it is a predictable step function. What makes it truly unscalable is that the complexity is both abruptly and, worse, unpredictably step-py. You will add 10 rules sometimes. Add an 11th rule and it causes a conflict that leads to a 2-day investigation and debugging! You might add 100 nodes with ease. Add an extra node past some IP-range and you’ll be spending weeks with a network-tracer looking for the problem.

An example a bit closer to home. We’ve been looking for a home for Polyverse’s BigBang system — the world’s largest build farm that powers all the scrambling you get transparently and easily.

As an aside, you’ll notice that Polymorphic Linux is “scalable.” What cost/complexity does it take for n nodes? Whether that n be 1, 100, 10,000, 10,000,000? The answer is easily O(n). It is sub-linear in practice, but even in the worst case it is linear. There are no emergency consultants, system designers or architects required to rethink or redesign anything. This is an example of what good scalability looks like.

Behind the scenes of that scalability though, is another story. I’ve spoken to nearly every cloud provider on the planet. I may have missed a few here and there, but I bet if you named a vendor, I’ve spoken to them. They all have “scalable systems,” but what they really have are various systems built to different sizes.

Finding clouds/systems/clusters that can just run really, really large loads is easy. Running those loads is also easy. Finding clouds that are predictable in complexity based on a particular load? Even with all the cloud propaganda, that’s a tough one.

Cybersecurity needs more scalable systems, not systems “at scale”

Scalable systems are not about size, numbers or capability. They have a predictable cost in the dimension of size.

Hopefully I’ve explained what scalable really means. In much the same way that you’d measure a system in number of operations, amount of memory, number of transactions, or expected wall-clock time, a scalable system is operationally predictable in terms of size.

It doesn’t have to be cheap or linear. Merely predictable.

Cybersecurity today is desperately in need of solutions that “can scale,” not ones that merely run “at scale.” We need scalable solutions that encourage MORE security by adding MORE money. Not haphazard, arbitrary and surprising step functions.

Automatic Mitigation of Meltdown

Let’s look at what Meltdown is and how it works, as well as how it is stopped. A lot has been written about the Meltdown vulnerability, but it is still commonly misunderstood. A few diagrams may help.

First, let’s consider a simplified memory hierarchy for a computer: main memory, split into user memory and kernel memory; the cache (typically on the CPU chip); and then the CPU itself.

The bug is pretty simple. For about two decades now, processors have had a flag that tells them what privilege level a certain instruction is running in. If an instruction in user space tries to access memory in kernel space (where all the important stuff resides), the processor will throw an exception, and all will be well.

On certain processors though, the speculative executor fails to check this bit, thus causing side-effects in user space (caching of a page), which the user space instructions can test for. The attack is both clever and remarkably simple.

Let’s walk through it graphically. Assume your memory starts with this flushed cache state — nothing sits in the cache right now (the “flush” part of what is a a “flush-reload” attack):

Step 1: Find cached pages

First let’s allocate 256 pages on the user space that we can access. Assuming a page size of 4K, we just allocate 256 times 4K bytes of memory. It doesn’t matter where those pages reside in user-space memory, so long as we got the page size correct. In C-style pseudo-code:

char userspace[256 * 4096];

I’ll mark those in the userspace diagram — for brevity, I’ll only show a few pages, and I’m going to show cached pages popped up like this:

This allows for easier reading (and easier drawing for me!).

So let’s start with an empty (flushed) cache:

We know what the cache state would be if we accessed a byte in page 10. Since any byte in page 10 would do the trick, let’s just use the very first byte (at location 0).

The following code accesses that byte:

char dummy = userspace[10 * 4096];

This leads the state to be:

Now what if we measured the time to access each page and stored it?

int accessTimes[256];
for (int i=0; i < 256; i++) {
    t1 = now();
char dummy = userspace[i * 4096];
    t2 = now();
accessTimes[i] = t2-t1;

Since page 10 was cached, page 10’s access time would be significantly faster than all other pages which need a roundtrip to main memory. Our access times array would look something like this:

accessTimes = [100, 100, 100, 100, 100, 100, 100, 100, 100, 10, 100, 100....];

The 10th value (page 10) is an order of magnitude faster to access than anything else. So page 10 is cached, whereas others were not. Note though that all of the pages did get cached as part of this access loop. This is the “reload” part of the flush-reload side-channel — because we reloaded all pages into the cache.

At this point we can figure out which pages are cached with ease if we flush the cache, allow someone else to affect it, then reload it.

Step 2: Speculate on kernel memory

This step is easy. Let’s assume we have a pointer to kernel memory:

char *kernel = 0x1000; //or whatever the case is

If we tried to access it using an unprivileged instruction, it would fail — our user space instructions don’t have a privileged bit set:

char important = kernel[10];

Speculating this is easy. The instruction above would speculate just fine. It would then throw an exception, which would cause us to never get the value of important.

Step 3: Affect userspace based on speculated value

However, what happens if we speculated this?

char dummy = userspace[kernel[10] * 4096]

We know userspace has 256 * 4096 bytes — we allocated it. Since we’re only reading one byte from the kernel address, the maximum value is 255.

What happens when this line is speculated? Even though the processor detected the segmentation fault and prevented you from reading the value, did you notice that it cached the user-space page? The page whose number was the value of kernel memory!

Suppose the value ofkernel[10] was 17. Let’s run through this:

  1. Processor obtained kernel[10] using the branch predictor. That value was 17.
  2. The processor then dereferenced the 17th 4K-wide page in the array “userspace”: userspace[17 * 4096]
  3. The processor detected that you weren’t allowed to access kernel[10], and so told you you can’t execute the branch. Bad programmer!
  4. The processor left the cache untouched. It’s not going to let you touch kernel memory on the cache though. It’s got your back…

What was the state of cache at the end of this?

That’s cool! Using Step 1, we would get the 17th page time being the fastest — by a large amount from the others! That tells us the value of kernel[10] was 17, even though we never accessed kernel[10]!

Pretty neat huh? By going over the kernel byte by byte, we can get the value of every kernel address, by affecting cache pages.

What went wrong? How are we fixing it?

Meltdown is a genuine “bug” — it’s not in the side-channel. The bug is straightforward — CPU speculative execution should not cross security boundaries — and ultimately should be fixed in the CPU itself.

It’s not the cache that’s misbehaving — even though that’s where most operating-system vendors are fixing it. More precisely, they are attempting to further isolate kernel and userspace memory, using something called Kernel Page Table Isolation (KPTI), previously called KAISER. It maps very few “stub” pages to the process’s virtual memory, keeping the kernel out (and thus not reachable by the speculative execution engine).

Unfortunately, this segmentation is coming at a cost — accessing kernel memory now requires more expensive hardware-assisted transitions.

Polymorphic Linux stops ROP attacks; increases difficulty of others

Since Polymorphic Linux was intended for stopping ROP attacks dead in their tracks, all ROP attacks in kernel space are defeated by using polymorphic kernels. Especially when KASLR (kernel address space layout randomization) is defeated (which is so trivial that the Meltdown paper leaves it as an exercise for the reader).

Furthermore, since polymorphic binaries have different signatures, layouts, instructions and gadgets, they make it difficult by at least an order of magnitude to craft further attacks. Polymorphic binaries force the extra step of analysis and understanding per binary. This means that a lateral attack (one that moves from machine to machine in a network) becomes much harder.

Look out for my next post on Spectre. It’s a bit more difficult to explain and definitely harder than Meltdown to craft…

Fun with binaries!

ASLR and DEP defeated with three instructions and one offset!

This is Part 2 of my previous post that demonstrated how you craft undetectable attacks against binaries, using our colorful Open Source Entropy Visualization tool. I left you with a cliffhanger… so let’s begin there!

Recap of the cliffhanger

The cliffhanger I left you with was that all we need are three tiny ROP gadgets, and the offset of mprotect, to make any arbitrary part memory executable. First, I present my proof:

This is a video by Roy Sundahl, one of our most senior engineers, and our resident ROP expert who spends a lot of his time figuring out offensive tools.

Before we proceed, if you’re wondering why we can’t just block calls to mprotect, it turns out there’s some truth to Greenspun’s tenth rule. Let’s forgo the obvious candidates like interpreters and JITers. I learned that the tiniest of programs that might use regular expressions will need to call mprotect — including the innocuous “ls”.

Let’s cast a wider net!

Okay that exploit was cool, and you can do this for yourself by finding gadgets across all the libc’s in the samples.

But can we do more? Can we easily go after a range of machines *without* knowing a target signature? Let’s find out!

Here I’m comparing the same “version” of libc across CentOS 7.1 and 7.2. For a quick reference, on the right, rows with a red background are gadgets that survived perfectly, yellow background are gadgets that exist but at a different location, and no background are gadgets that didn’t exist in the first file.

We found some 2503 gadgets across them. You notice how little variation there is when the code was compiled at two different times, from what is probably two variations. The more gadgets that fall on the same addresses, the easier it is for us to cast a wide net since it requires that many fewer custom craftings to go after a binary. The way to determine if your exploit will work across both, first filter the right side by “Surviving Gadgets”, and then search for gadgets you want.

Let’s try that across CentOS 7.1 and 7.2. First up, pop rdi ; ret? Yep! There it is! The first common address is: c6169.

Second up, pop rsi ; ret? Yep! There it is also! First common address is: c7466.

Finally, pop rdx ; ret? Yep! The first surviving address is: 1b92.

We got our complete ROP chain across both binaries: c6169 c7466 1b92. We can validate this by simulating execution across both binaries.

Now you know the complete power of the tool!

This is what the tool is intended to do! You can verify rop chains across binaries without ever leaving your browser. You can now tell, visually and graphically, whether a particular attack will work against a given binary you run. It can be used to craft attacks, but it can also be used to ensure that a patch really worked.

There’s a bit of emotional comfort when you can execute a chain visually, see how the flow jumps around, and see that it doesn’t work.

Are Overflows/Leaks that common?

All this depends of course, on you being able to manipulate some little bit of stack space. Aren’t overflows so…. 2000s? We use bounds-checked modern languages that don’t suffer from these problems.

First of all, if you subscribe to our weekly breach reports, you’ll empirically find that overflows and memory leaks are pretty common. Even the internet’s favorite language, Javascript, is not immune.

Secondly, my best metric to find truth is to look for back-pressure (the sociological version of proof-by-contradiction). Look out for attempts at locking this down 100%, and then follow the backlash.

However, I also want you to get an intuitive understanding of where they arise and why they happen.

Even I have to admit that certain operations (such as sorting or XML/JSON parsing) are better implemented by manipulating memory buffers directly, despite my well-publicized extremist views favoring immutable data and list comprehensions,

So what does a “real overflow” look like? (Code in the samples directory.)

#include <stdio.h>
#define BUF_LEN 20
int main()
    char buf[BUF_LEN];
    int i=0;
    while (i++ < BUF_LEN) {
        printf("Setting buf[%d] to zero. n",i);
        buf[i] = 0;

I just overwrote a byte on the stack frame. It’s obvious when I point it out. If you were working on this code and not looking for overruns, this is easy to miss. Ever seen the college textbook example of a quicksort using while-loops to avoid using the system stack? They are liberal with while(1)s all over the place.

Personal Rant: They are very common, and they are insanely difficult to find. This is why I’m such an extremist about immutability, list comprehensions, symbolic computation. For your business apps, you should NEVER, unless under extreme exceptions, listen to that “clever” developer who is doing you the favor of writing efficient code. Pat them on the back. Give them a promotion or whatever. Get them out of the way. Then find a lazy person who’ll use list-comprehensions and copy-on-change wherever possible! I’m a big believer in Joe Armstrong’s advice here: First make it work. Then make it beautiful. Finally, if necessary, make it fast.

In our analyses, more than 65% of critical CVEs since June 1st fell under this category. I could be off by a few points on that number since it changes as we compile our reports periodically and tweak how we classify them. But it’s well over 60%.

Putting it all together

In Part 1, I showed you what ROP gadgets are, how to find them, chain them, and exploit them.

In Part 2, I completed the story by demonstrating how to find common gadgets across a wide array of deployed binaries.

The purpose of the Entropy Visualizer is to enable all this decomposition in your browser. In fact this is an easier tool than most ROP finders I know. 🙂

Happy Hunting!