A Bridge Over a River Never Crossed


A Bridge Over a River Never Crossed

When I first started my forever project, a peer to peer file sync software using Interval Tree Clocks, I wanted to build it right.

That meant property-based testing everything, specifying the protocol fully, dealing with error conditions, and so on. Hell, I grabbed a copy of a TLA+ book to do it.

I started a document where I noted decisions and managed to write up a pretty nifty file-scanning library that could pick up and model file system changes over trees of files. The property tests are good enough to find out when things break due to Unicode oddities, and I felt pretty confident.

Then I got to writing the protocol for a peer-to-peer sync, with the matching state machines, and I got stuck. I couldn't come up with properties, and I had no idea what sort of spec I would even need. Only one sort of comparison kept popping in my mind: how do you invent the first telephone?

It’s already challenging to invent any telephone (or I would assume so at least), even with the benefit of having existing infrastructure and networks, on top of having other existing telephones to test it with. But for the first telephone ever, you couldn’t really start with a device that has both the mouthpiece and the ear piece in place, and then go “okay now to make the second one” and have a conversation once they're both done.

In some ways you have to imagine starting with two half-telephones, with a distinct half for each side. You start with a part to speak in and a speaker part that goes on the other side and send messages one way, and then sort of gradually build up a whole pair I guess?

An actor portraying Alexander Graham Bell speaking into a early model of the telephone for a 1926 promotional film by AT&T, public domain. The phone is a simple conical part  in which the actor is speaking, attached to a piece of wood, with no ear piece at all

This was the sort of situation I was finding myself in for the protocol: I wanted to build everything correctly the first time around, but I had no damn idea about how to wire up only one fine half to nothing just to figure out what shape exactly should a whole exchange have. I couldn't do it right all at once.

I had written protocols before, I had written production-grade distributed software before, there was prior art for this sort of thing, but I had never built this specific one.

This was like wanting to build a bridge, a solid one, to go over a river I had never crossed before. I could imagine the finished product’s general shape and purpose, I was eager to get to cross it. I had worked on some before, but not over this specific river. Hell, without having gone over the gap once end-to-end, I had no idea what the other side looked like.

I had also prototyped things before, and always wanted to make sure the prototype wouldn't end up in production too. As it turns out, forcing myself to prototype things and make a very bad version of the software was the most effective way to make a slightly less bad version of it that follows. And then a slightly better one, and another. This was iterative development winning over careful planning.

I’m at the point where I have a shoddy wooden bridge that I can cross over on. It’s real crappy software, it doesn’t deal with errors well (it’s safe and doesn’t break things, but it’s also unreliable and hangs more than I'd like), it’s not very fast, and it's downright unusable. But I now have a lot more infrastructure to work with. And once I’m through with the mess, I can maybe design a nicer form of it.

Building the bridge as you cross the river for the first time is a paralyzing thought, and despite all my wishes about it being done right on that initial attempt, it turns out it's a pretty good idea to make that first crossing as cheap and easy to tear down—and replace—as possible.

Saying "build a prototype and be ready to replace it" is a long known piece of conventional wisdom. The challenge is how crappy or solid should your prototype be? What is it that you're trying to figure out, and are you taking the adequate means for it?

There is a difference between a rough sketch with the right proportions and exploring from an entirely wrong perspective. Experience sort of lets you orient yourself early, and also lets you know which kind you have in your hands. I guess I'll find out soon if all the fucking around was done in the proper direction.

Funnily enough, traditional arch bridges were built by first having a wood framing on which to lay all the stones in a solid arch. That wood framing is called falsework, and is necessary until the arch is complete and can stand on its own. Only then is the falsework taken away. Without it, no such bridge would be left standing. That temporary structure, even if no trace is left of it at the end, is nevertheless critical to getting a functional bridge.

Falsework centering in the center arch of Monroe Street Bridge, Spokane, Washington, 1911. An elaborate wooden structure is supporting concrete until it can self-stand.

I always considered prototypes to be a necessary exploratory step, where you make mistakes, find key risks about your project, and de-risk a more final clean version. They were dirty drafts, meant to be thrown out until a clean plan could be drawn. I thought, if I had enough experience and knowledge, I could have that clean plan and just do it right.

Maybe I just needed to get over myself and consider my prototype to in fact be Falsework: essential, unavoidable, fundamentally necessary, even if only temporary.


Cheatsheets and other 8 ExDoc features that improve the developer experience

ExDoc has a cool new feature, cheatsheets!

In this blog post, we’ll explain what that new feature is and the motivation behind it. We’ll also take the opportunity to highlight other ExDoc features that show how it has been evolving to make the documentation experience in Elixir better and better.

What are ExDoc cheatsheets and how they improve the documentation experience

ExDoc’s cheatsheets are Markdown files with the .cheatmd extension. You can see an example of how the Ecto project is using them.

Writing and reading cheatsheets is not exactly new to developers. What ExDoc brings to the table is the possibility of integrating cheatsheets alongside the rest of the documentation of an Elixir project, instead of hosting them in a different place.

Developers need different kinds of documentation at different times. When one is learning about a new library, a guide format is proper. When one needs to know if a library can solve a specific problem, an API reference can be more appropriate. When someone wants to remember a couple of functions they already used from that library, a cheatsheet could be more practical.

Imagine if you had to go to a different place for every type of documentation you’re looking for. That would make a very fragmented experience, not only for readers of documentation but also for writers.

ExDoc cheatsheets represent one step further in the direction of making documentation in Elixir an even more comprehensive and integrated experience.

ExDoc cheatsheets are inspired by devhints.io from Rico Sta. Cruz, and were contributed by Paulo Valim and Yordis Prieto.

Eight features that show how ExDoc has improved developer experience over time

We added cheatsheets to ExDoc because we value developer experience and believe documentation is a core aspect of it.

Since the beginning, one of Elixir’s principles is that documentation should be a first-class citizen. What this idea means to us is that documentation should be easy to write and easy to read. ExDoc has been continuously evolving over the years, guided by this principle.

Here are some of the features added to ExDoc over the years that make reading and writing documentation in Elixir a joy.

Beautiful and usable design

As developers, we may not have the skill to make beautifully designed UIs. That doesn’t mean we don’t appreciate it. Here’s what documentation generated with ExDoc looked like almost ten years ago, with its original layout based on YARD:

Screenshot of the Phoenix v0.5.0 documentation generated with an early version of ExDoc

Here’s what it looks like today:

Screenshot of the Phoenix v1.6.15 documentation generated with current ExDoc

The evolution of ExDoc’s design helped documentation be more visually appealing and easier to read and navigate.

Sometimes you’re reading the documentation of a library, and you want to know more about the implementation of a function. Or, you found something in the documentation that could be improved and wants to help. In those situations, it’s helpful to go from the documentation to the source code. ExDoc makes that dead easy. For every module, function, or page, ExDoc gives you a link that you can click to go directly to the project’s source code on GitHub:

Short screencast of a user clicking on the "link to source code" button on the documentation for a function


One of the most common formats of library documentation is an API reference. But depending on your needs, that’s not the most approachable format. For example, it’s not optimal when you’re just getting started with a library or when you want to learn how to solve a specific problem using it. That’s why ExDoc allows writing other types of docs besides API references, like “Getting started” guides or How-tos.

Look at how Ecto’s documentation uses that, for example:

Screencast of a user exploring the guides in the Ecto documentation

Custom grouping of modules, functions, and pages in the sidebar

Sometimes your library has dozens of modules. Sometimes, one given module has a large API surface area. In those situations, showing the list of functions as a single large list may not be the most digestible way to be consumed. For those cases, ExDoc allows modules, functions, or extra pages to be grouped in the sidebar in a way that makes more sense semantically.

Here’s an example of how Ecto use grouped functions for its Repo module:

Screenshot of the sidebar of the Ecto documentation, showing grouped functions in the `Ecto.Repo` module

Instead of listing the ~40 functions of Ecto.Repo as a single extensive list, it presents them grouped by five cohesive topics:

  • Query API
  • Schema API
  • Transaction API
  • Runtime API
  • User callbacks

The same functionality is available for modules and pages (guides, how-tos, and so on). Phoenix is a good example of how that’s used.

Sometimes you don’t know or don’t remember the name of the function that you’re looking for. For example, let’s say you’re looking for a function for dealing with file system directories.

Although there’s no function or module called “directory” in Elixir, when you type “directory” in Elixir’s documentation, it will return all the entries that have the word “directory” inside the documentation. It will even return entries with variations of the word “directory”, like “directories”, doing a fuzzy search.

Screenshot of the result of searching for "directory" in the Elixir documentation

The search bar also supports autocompletion for module and function names:

Screencast of a user typing the word "Enum" in the search bar of Elixir's documentation and letting it autocomplete the module. Then, the user types "Range" and both modules and functions show up.

The best part is that full-text search is fully implemented on the client-side, which means ExDoc pages can be fully hosted as static websites (for example on GitHub Pages).

Keyboard shortcuts to navigate to docs of other Hex packages

It’s common for an application to have dependencies. While coding, we usually need to read the documentation of more than one of those dependencies.

One solution is to keep a window open for the documentation of each dependency. However, ExDoc offers another option: a keyboard shortcut to search and go to another package documentation within the same window.

Here’s what it looks like:

Screencast of a user enabling the `g` shortcut to search through dependencies documentation and then using it to search for "phoenix_live" in the documentation for Nerves.

There are more keyboard shortcuts to help you navigate within and between documentation:

Screenshot of the keyboard shortcuts that you can enable in ExDoc

A version dropdown to switch to other versions

Keeping our application updated with the latest versions of all its dependencies can be challenging. So, it’s common to need to look at the documentation of an older version of a library we’re using. ExDoc makes it simple to do that.

When you access the documentation of a project, there’s a dropdown that you can use to select the version you’re looking for:

Screencast of a user typing the version dropdown under the application name in the "timex" documentation, revealing all the versions.

Livebook integration

Livebook is a web application for writing interactive and collaborative code notebooks in Elixir.

One of the ways Elixir developers have been using Livebook is for documentation. Because of its interactivity capabilities, it enables the reader to play with the code right inside the documentation, which makes it great for tutorials and augmenting the user experience.

With that in mind, ExDoc offers the possibility of integrating Livebook notebooks. That means one can host Livebook-based documentation together with the API reference.

Here’s an example of using Livebook inside ExDoc for writing a Usage Guide:

Screencast of a user navigating through the "req_sandbox" documentation, finding a Livebook, clicking "Run in Livebook", and using the Livebook that opens up on their local machine.

Bonus: Erlang support

EEP 48 proposed a standardized way for how BEAM languages could store API documentation. This allows any BEAM language to read documentation generated by each other.

By leveraging that work, ExDoc can generate documentation for an Erlang project. For example, Telemetry is a library written in Erlang that has its documentation generated with ExDoc.

Screenshot of "telemetry" documentation generated with ExDoc

By using ExDoc to also generate documentation for Erlang-based projects, we can have more consistency in the user experience along the BEAM ecosystem. See the great rebar3_ex_doc plugin to get started.

Bonus: Doctests

When writing documentation, it’s helpful to offer code examples. For instance, here’s the documentation of the Enum.any?/1 function from Elixir’s standard library:

@doc """
Returns `true` if at least one element in `enumerable` is truthy.

When an element has a truthy value (neither `false` nor `nil`) iteration stops
immediately and `true` is returned. In all other cases `false` is returned.

## Examples

  iex> Enum.any?([false, false, false])

  iex> Enum.any?([false, true, false])

  iex> Enum.any?([])


To ensure examples do not get out of date, Elixir’s test framework ExUnit provides a feature called doctests. This allows developers to test the examples in their documentation. Doctests work by parsing out code samples starting with iex> from the documentation.

Although this is not a feature of ExDoc, it is an essential part of Elixir’s developer and documentation experience.

Wrap up

As we saw, ExDoc has evolved a lot throughout the years! As it continues to evolve into a more and more comprehensive documentation tool, we want to enable developers to keep investing more time writing the documentation itself instead of needing to spend time building custom documentation tools and websites. The best part is that all you need to do to leverage many of those features is to simply document your code using the @doc attribute!

Here’s to a continuously improving documentation experience for the next years.


The Law of Stretched [Cognitive] Systems


The Law of Stretched [Cognitive] Systems

One of the things I knew right when I started at my current job is that a lot of my work would be for "nothing." I'm saying this because I work (as Staff SRE) for an observability vendor, and engineers tend to operate under the idea that the work they're doing is going to make someone's life easier, lower barriers of entry, or just make things simpler by making them understandable.

While this is a worthy objective that I think we are helping, I also hold the view that any such improvements would be used to expand the capacities of the system such that its burdens remain roughly the same.

I hold this view because of something called the Law of stretched systems:

Every system is stretched to operate at its capacity; as soon as there is some improvement, for example in the form of new technology, it will be exploited to achieve a new intensity and tempo of activity.

Chances are you've noticed that the more RAM computers have, the more RAM browsers are going to take for tabs. The faster networks get, the larger the web pages that are served to you are going to be. If storage space is plentiful and cheap, movies and games and pictures are all going to get bigger and occupy that space.

If you've maintained APIs, you may have noticed that no matter what rate limit you put on an endpoint or feature, someone is going to ask for an order of magnitude more and find ways to use that capacity. You give people 10 alerts of budget for their top-line features, and they'll think 100 would be nice so you have one per microservice. You give them 100 and they start thinking maybe a 1,000 would be nice so each team can set 10, for the various features they maintain. You give them 1,000 and they start thinking 10,000 would be quite nice so each of their customers could get its own alert. Give more and maybe they can start reselling the feature themselves.

What is available will be used. Every system is stretched to operate at its capacity. Systems keep some slack capacity, but if they operate for long periods of time without this capacity being called upon, it likely gets optimized away.

Similar examples seem to also be present in larger systems—you can probably imagine a few around just-in-time supply chains given the last few years—but I'll avoid naming specifics as I'd be getting outside my own areas of expertise.

The law of stretched systems, I believe, applies equally well to most bottlenecks you can find in any socio-technical system. This would include your ability to keep up with what is going on with your system (be it social or technical) due to its complexity, intricacies, limited observability or understandability, or difficulties to enact change.

As far as I can tell, cognitive bandwidth and network bandwidth both display similar characteristics under that lens. That means that gaining extra capacity to understand what is going on, more clarity into the actions and functioning of the system is not likely to make your situation more comfortable in the long term; it's just going to expand how much you can accomplish while staying on that edge of understandability.

The pressures that brought the system's operating point where it is are likely to stay in place, and will keep influencing feedback loops it contains. Things are going to stay as fast-paced as they were, grow more hectic, but with better tools to handle the newly added chaos, and that's it. And that is why the work I do is for "nothing": things aren't going to get cozier for the workers.

Gains in productivity over the last decades haven't really reduced the working hours of the average worker, but it has transformed how it is done. I have no reason to believe that gains in understandability (or on factors affecting productivity) would change that. We're just gonna get more software, moving faster, doing more things, always bordering on running out of breath.

And once the system "stabilizes", that the new tools or methods become a given, when they fade in the background as normal everyday things, the system will start optimizing some of its newly found slack away. Its ability to keep going will become dependent on these tools, and were they to stop working, major disruptions should be expected to adapt and readjust (or collapse).

This has, I suppose, a weird side-effect in that it's an ever-ongoing ladder-pulling move. The incumbent tool-chain has greater and greater network effects, and any game-changing approach that does not mesh well with the rest of it (but could sustain even greater capacity had it received a similar investment) will never be able to get a foothold into an established ecosystem. Any newcomer in the landscape has to either pay an ever-increasing cost just to get started, or has to have such a different and distinct approach that it cancels out the accrued set of optimizations others have.

These costs may be financial, technological, in terms of education and training, or cognitive. Maybe this incidental ladder-pulling is partially behind organizations always wanting—or needing—ever more experienced employees even for entry-level positions.

There's something a bit disheartening about assuming that any move you make that improves how understandable a system is will also rapidly be used to move as close to the edge of chaos as possible. I however have so far not observed anything that would lead me to believe things are going to be any different this time around.


Erlang/OTP 25.2 Release

OTP 25.2

Erlang/OTP 25.2 is the second maintenance patch package for OTP 25, with mostly bug fixes as well as improvements.

Below are some highlights of the release:

Potential incompatibilities:

  • The inet:setopts/2 {reuseaddr, true} option will now be ignored on Windows unless the socket is an UDP socket. For more information see the documentation of the reuseaddr option part of the documentation of inet:setopts/2. Prior to OTP 25 the {reuseaddr, true} option was ignored for all sockets on Windows, but as of OTP 25.0 this was changed so that it was not ignored for any sockets.

The Erlang/OTP source can also be found at GitHub on the official Erlang repository, https://github.com/erlang/otp

Download links for this and previous versions are found here


Get Rid of Your Old Database Migrations

Database migrations are great. I love to be able to change the shape of tables and move data around in a controlled way to avoid issues and downtime. However, lately I started to view migrations more like Git commits than like active pieces of code in my applications. In this post, I want to dig a bit deeper into the topic. I'll start with some context on database migrations, I'll expand on the Git commits analogy, and I'll show you what I've been doing instead.

I mostly write Elixir, so my examples will be in Elixir. If you're interested in the workflow I’m currently using for Elixir, jump to the last section of the post.

Cover image of a flock of birds flying, with a sunset sky in the background

What Are DB Migrations

DB migrations are pieces of code that you run to change the schema (and data, if desired) of your database. Wikipedia does a great job at explaining migrations, their history, and their use cases. I'll write a few words here for context, but go read that if you want to dig deeper.

Migrations have a few selling points over executing direct SQL commands.

First, database migrations usually keep information about the migrations themselves in a dedicated database table. To be precise, the framework that runs the migrations is the one storing information in that table. The framework uses this table to only run migrations that have not been run yet.

Another benefit is that you'll usually write migrations in a way that makes it possible to roll them back. "Rolling back" a migration means executing SQL statements that revert the changes done in that migration. Most database frameworks can infer the rollback steps from the migration definition if the migration is simple enough. This results in concise pieces of code to alter the database schema that can be executed "in both directions".

Let's see an example with Ecto, Elixir's database framework.

Example with Ecto

Let's say you have a users table with columns email and password. New requirements force you to add two new columns, first_name and last_name. To do that, you can write a migration that looks like this:

defmodule MyApp.AddNameToUsers do
  use Ecto.Migration

  def change do
    alter table("users") do
      add :first_name, :string
      add :last_name, :string

In this simple example, the framework is able to do what I mentioned above: you can define a single change/0 function with migration steps in it, and the framework is able to infer the corresponding rollback steps (in this case, removing the two new columns).

When you run the migration (with mix ecto.migrate in this case), Ecto adds a row to the schema_migrations table:

202211142328412022-11-15 21:27:50

Why Do We Keep Migrations Around?

Until recently, I had never worked on an application that did not keep all the migrations around. I'd always seen the priv/repo/migrations directory in Elixir applications full of files. I want to get one disclaimer out of the way: the many-migrations experience is a personal one, and I might be late to the party here. But hey, here's to hoping someone else is too and that this write-up is gonna help them out.

At one point, I started working on an older unfamiliar codebase. The experience made me think of two things.

The first one is reading the complete, up-to-date database schema structure. I'd constantly fire up a Postgres GUI (I use TablePlus) to look at the structure of the database, since it was hard to navigate old migrations and piece together what their end result is.

The second one revolves around searching through code. Working on the new codebase involved a lot of searching all around the code to understand the structure of the application. Function names, modules, database columns, and what have you. However, database columns stuck with me: I'd always find a bunch of misleading search results in old migrations. For example, I'd see a few results for a column name that was created, then modified, and then dropped.

So I started wondering: why do we keep old migrations around? Don't get me wrong, I know why we write migrations in the first place. They're great, no doubts. But why not throw them away after they've done their job? How many times did you roll back more than one migration? I have never done that. It's hard to imagine rolling back many changes, especially when they involve not only the schema but also the data in the database itself. There must be a better way.

Analogy with Git Commits

I started to think of database migrations more like Git commits. You apply commits to get to the current state of your application. You apply database migrations to get to the current schema of your database. But after some time, Git commits become a tool for keeping track of history more than an active tool for moving back and forth between versions of the code. I'm now leaning towards treating database migrations the same way. I want them to stick around for a bit, and then "archive" them away. They're always going to be in the Git history, so I’m never really losing the source file, only the ability to apply the migrations.

So, how do we deal with this in practice?

Dumping and Loading

It turns out that this is something others have already thought about.

Database frameworks that provide migration functionality usually provide ways to dump and load a database schema. If they don't, fear not: major databases provide that themselves. In fact, in Elixir Ecto's dump and load tasks only really act as proxies on top of tools provided by the underlying databases (such as pg_dump and psql for PostgreSQL).

The idea is always the same: to get the current state of the database, you'll run the dumping task. With Ecto and other frameworks, this produces an SQL file of instructions that you can feed to your database when you want to load the schema again.

Some frameworks provide a way to squash migrations instead. Django, for example, has the squashmigrations command. However, the concept is almost the same. Ruby on Rails's ActiveRecord framework has a unique approach: it can generate a Ruby schema file from migrations. It can also generate the SQL schema file mentioned above via the database, but the Ruby approach is interesting. Its power is limited, however, since the Ruby schema file might not be able to reconstruct the exact schema of the database. From the documentation:

While migrations may use execute to create database constructs that are not supported by the Ruby migration DSL, these constructs may not be able to be reconstituted by the schema dumper.

Dumping and loading the database schema works well in local development and testing, but not in production though, right? You don't want to load a big old SQL file in a running production database. I think. Well, you don't really have to. Production databases tend to be reliable and (hopefully) backed up, so "restoring" a schema is not something you really do in production. It'd be analogous to re-running all the migrations: you just never do it.

Advantages and Disadvantages of Ditching Old Migrations

I find that dumping old migrations and loading an up-to-date SQL file has a few advantages.

  1. You get a complete view of the schema — The SQL file with the database schema now represents a complete look at the structure of the database. You can see all the tables, indexes, default values, constraints, and so on. Sometimes, you'll still need to create migrations and run them, but they're going to live in your codebase only temporarily, and it's only going to be a handful of them at a time, instead of tens (or hundreds) of files.

  2. Speed: A minor but not irrelevant advantage of this approach is that it speeds up resetting the database for local development and tests. Applying migrations can do many unnecessary operations in the database, such as creating tables only to delete them just after. When loading the database dump, you're really doing the smallest possible set of commands to get to the desired state.

However, it's not all great here. There are some disadvantages as well:

  1. Digging through Git — there are going to be situations in which you look at the migration table in your database and want to figure out the migration that corresponds to a given row. This approach makes this use case slightly more annoying, because you'll have to dig through your Git history to find the original migration. Not a big deal in my opinion, I don't really do this that much.

  2. Deploying without running migrations — make sure to deploy and run migrations. With this approach, that's not something to give for granted. You might get a bit too comfortable dumping the current database schema and deleting migration files. You might end up in situations where you create a migration, run it locally, and then dump the schema and delete the migration, all without deploying. This would result in the migration not running in production.

Workflow in Elixir

Now for a small section specific to Elixir. Ecto provides the dump and load tasks mentioned above, mix ecto.dump and mix ecto.load respectively.

In my applications, I've been doing something like this:

  1. I updated the common Mix aliases for dealing with migrations to take dumping/loading into account. Those aliases look something like this now:

    defp aliases do
        "ecto.setup": ["ecto.create", "ecto.load", "ecto.migrate"],
        "ecto.reset": ["ecto.drop", "ecto.setup"],
        test: [
          "ecto.create --quiet",
          "ecto.load --quiet --skip-if-loaded",
          "ecto.migrate --quiet",

    As you can see, the aliases now always run mix ecto.load before calling mix ecto.migrate. The --skip-if-loaded flag in the test alias ensures that the command is idempotent, that is, can be run multiple times without changing the result.

  2. I added a Mix alias to "dump migrations", that is, dump the current database structure and delete all the current migration files. It looks like this:

    defp aliases do
      dump_migrations: ["ecto.dump", &delete_migration_files/1]
    defp delete_migration_files(_args) do
      # Match all files in the 21st century (year is 20xx).
      Enum.each(Path.wildcard("priv/repo/migrations/20*.exs"), fn migration_file ->
        Mix.shell().info([:bright, "Deleted: ", :reset, :red, migration_file])

    The path wildcard could be improved, or you could have logic that reads the files and checks that they're migrations. However, this does a good-enough job.


If you start to look at database migrations as analogous to Git commits, you can start to treat them that way. We saw how to use the "dumping" and "loading" functionality provided by many database and database frameworks. We saw the advantages and disadvantages of this approach. Finally, I showed you the approach I use in Elixir.


Hiding Theory in Practice


Hiding Theory in Practice

I'm a self-labeled incident nerd. I very much enjoy reading books and papers about them, I hang out with other incident nerds, and I always look for ways to connect the theory I learn about with the events I see at work and in everyday life. As it happens, studying incidents tends to put you in close proximity with many systems that are in various states of failure, which also tends to elicit all sorts of negative reactions from the people around them.

This sensitive nature makes it perhaps unsurprising that incident investigation and review facilitation come with a large number of concepts and practices you are told to avoid because they are considered counterproductive. A tricky question I want to discuss in this post is how to deal with them when you see them come up.

A small sample of these undesirable concepts includes things such as:

  • Root Cause: I've covered this one in Errors are constructed, not discovered. To put it briefly, focusing on root causes tends to narrow the investigation in a way that ignores a rich tapestry of contributing factors.
  • Normative Judgments: this is often used when saying someone should have done something that they have not. It carries the risk of siding with the existing procedure as correct and applicable by default, and tends to blame and demand change from operators more than their tools and support structure.
  • Counterfactuals: those are about things that did not happen: "had we been warned earlier, none of this would have cascaded." This is a bit like preparing for yesterday's battle. It's very often coupled with normative judgments ("the operator failed to do X, which led to ...")
  • Human Error: generally not a useful concept, at least not in the way you'd think. This is best covered in "Those found responsible have been sacked" by Richard Cook or The Field Guide to Understanding 'Human Error', but tends to be the sign of an organization protecting itself, or of a failed investigation. Generally the advice is that if you find human error, that's where the investigation begins, not where it ends.
  • Blame: psychological safety is generally hard to maintain if people feel that they are going to be punished for doing their best and trying to help. You can only get good information if people trust that they can reveal it. Blameless processes—or rather, blame-aware reviews aim to foster this safety.

There are more concepts than these, and each could be a post on its own. I've chosen this list because each of them is an absolutely common reaction, something so intuitive it will feel self-evident to people using them. Avoiding these requires a kind of unlearning, so that you can remove the usual framing you'd use to interpret events, and then gradually learning to re-construct them differently.

This is challenging, and while this is something you and other self-labeled incident nerds can extensively discuss and debate as peers, it is not something you can reasonably expect others to go through in a natural post-incident setting. Most of the people with whom you will interact will never care about the theory as much as you do, almost by definition since you're likely to represent expertise for the whole organization on these topics.

In short, you need to find how to act in a way that is coherent with the theory you hold as an inspiration while being flexible enough to not cause friction with others, nor requiring them to know everything you know for your own work to be effective.

As an investigator or facilitator, let's imagine someone who's a technical expert on the team comes to you during the investigation (before the review) and says "I don't get why the on-call engineer couldn't find the root cause right away since it was so obvious. All they had to do was follow the runbook and everything would have been fine!"

There are going to be times where it's okay to let go of these comments, to avoid doing a deep dive on every opportunity. In the context of a review based on a thematic analysis, the themes you are focusing on should help direct where you put your energy, and guide you to figure out whether emotionally-charged comments are relevant or not.

But let's assume they are relevant to your themes, or that you're still trying to figure them out. Here are two reactions you can have, which may come up as easy solutions but are not very constructive:

  • You may want to police their intervention: since you care for blame-awareness and psychological safety, you may want to nip this behavior in the bud and let them know about the issues around blame, normativeness and counterfactuals.
  • You may also want to ignore that statement, drop it from your notes, and make sure it does not come up in any written form. Just pretend it never came up.

In either case, if behavior that clashes with theoretical ideals is not welcomed, the end result is that you lose precious data, either by omission or by making participants feel less comfortable in talking to you.

Strong emotional reactions are as good data as any architecture diagram for your work. They can highlight important and significant dynamics about your organization. Ignoring them is ignoring potentially useful data, and may damage the trust people put in you.

The approach I find more useful is one where the theoretical points you know and appreciate guide your actions. That statement is full of amazing hooks grab onto:

  • That they believe it is obvious but was not to the on-call engineer hints at a clash in their mental models, which is a great opportunity to compare and contrast them. Diverging perspectives like that are worth digging into because they can reveal a lot.
  • The thought that the runbook is complete and adequate is worth exploring: was the on-call engineer aware of it? Are runbooks considered trustworthy by all? Were they entertaining hypotheses or observing signals that pointed another direction? Is there any missing context?
  • That counterfactual point ("everything would have been fine!") is a good call-out for perspective. Does it mean next time we need to change nothing? Can we look into challenges around the current situation to help shape decision-making in the future?
  • Is this frustrated reaction pointing at patterns the engineer finds annoying? Does it hint at conflicts or lack of trust across teams?
  • Zooming out from the "root cause" with a newcomer's eyes can be a great way to get insights into a broader context: is this failure mechanism always easily identifiable? Are there false positives to care for? Has it changed recently? What's the broader context around this component? You can discuss "contributing factors" even when using the words "root cause" with people.

None of these requires interrupting or policing what the interviewee is telling you. The incident investigation itself becomes a place where various viewpoints are shared. The review should then be a place where everyone can broaden their understanding, and can form their own insights about how the socio-technical system works. Welcome the data, use it as a foothold for more discoveries.

If you do bring that testimony to the review (on top of having used it to inform the investigation), make sure you frame it in a way that feels safe and unsurprising for all participants involved. Respect the trust they've put in you.

How to do this, it turns out, is not something about which I have seen a lot of easily applicable theory. It's just hard. If I had to guess, I'd say there's a huge part of it that is tacit knowledge, which means you probably shouldn't wait on theory to learn how to do it. It's way too contextual and specific to your situation. If this is indeed the case, theory can be a fuzzy guideline for you at most, not a clear instruction set.

This is how I believe theory is most applicable: as a hidden guide you use to choose which paths to take, which actions to prefer. There's a huge gap between the idealized higher level models and the mess (or richness) of the real world situations you'll be in. Navigating that gap is a skill you'll develop over time. Theory does not need to be complete to provide practical insights for problem resolution. It is more useful as a personal north star than as a map. Others don't need to see it, and you can succeed without it.

Thanks to Clint Byrum for reviewing this text.


¿Miscelánea o Procrastinación Encubierta?

Estaba pensando en escribir una nueva entrada en el blog, pero de repente me acordé que debía crear un fichero Markdown desde cero para eso y recordé que tenía esa característica en Lambdapad a medio terminar, cuando abrí el editor recordé que no había actualizado la versión de Elixir... ¿te ha pasado?


My Future with Elixir: set-theoretic types

This is a three-articles series on My Future with Elixir, containing excerpts from my keynotes at ElixirConf Europe 2022 and ElixirConf US 2022.

In May 2022, we have celebrated 10 years since Elixir v0.5, the first public release of Elixir, was announced.

At such occasions, it may be tempting to try to predict how Elixir will look in 10 years from now. However, I believe that would be a futile effort, because, 10 years ago, I would never have guessed Elixir would have gone beyond excelling at web development, but also into domains such as embedded software and making inroads into machine learning and data analysis with projects such as Nx (Numerical Elixir), Explorer, Axon and Livebook. Elixir was designed to be extensible and how it will be extended has always been a community effort.

For these reasons, I choose to focus on My Future with Elixir. Those are the projects I am personally excited about and working on alongside other community members. The topic of today’s article is type systems, as discussed in my ElixirConf EU presentation in May 2022.

The elephant in the room: types

Throughout the years, the Elixir Core Team has addressed the biggest needs of the community. Elixir v1.6 introduced the Elixir code formatter, as the growing community and large teams saw an increased need for style guides and conventions around large codebases.

Elixir v1.9 shipped with built-in support for releases: self-contained archives that consist of your application code, all of its dependencies, plus the whole Erlang Virtual Machine (VM) and runtime. The goal was to address the perceived difficulty in deploying Elixir projects, by bringing tried approaches from both Elixir and Erlang communities into the official tooling. This paved the way to future automation, such as mix phx.gen.release, which automatically generates a Dockerfile tailored to your Phoenix applications.

Given our relationship with the community, it would be disingenuous to talk about my future with Elixir without addressing what seems to be the biggest community need nowadays: static typing. However, when the community asks for static typing, what are we effectively expecting? And what is the Elixir community to gain from it?

Types and Elixir

Different programming languages and platforms extract different values from types. These values may or may not apply to Elixir.

For example, different languages can extract performance benefits from types. However, Elixir still runs on the Erlang VM, which is dynamically typed, so we should not expect any meaningful performance gain from typing Elixir code.

Another benefit of types is to aid documentation (emphasis on the word aid as I don’t believe types replace textual documentation). Elixir already reaps similar benefits from typespecs and I would expect an integrated type system to be even more valuable in this area.

However, the upsides and downsides of static typing become fuzzier and prone to exaggerations once we discuss them in the context of code maintenance, in particular when comparing types with other software verification techniques, such as tests. In those situations, it is common to hear unrealistic claims such as “a static type system would catch 80% of my Elixir bugs” or that “you need to write fewer tests once you have static types”.

While I explore why I don’t believe those claims are true during the keynote, saying a static type system helps catch bugs is not helpful unless we discuss exactly the type of bugs it is supposed to identify, and that’s what we should focus on.

For example, Rust’s type system helps prevent bugs such as deallocating memory twice, dangling pointers, data races in threads, and more. But adding such type system to Elixir would be unproductive because those are not bugs that we run into in the first place, as those properties are guaranteed by the garbage collector and the Erlang runtime.

This brings another discussion point: a type system naturally restricts the amount of code we can write because, in order to prove certain properties about our code, certain styles have to be rejected. However, I would prefer to avoid restricting the expressive power of Elixir, because I am honestly quite happy with the language semantics (which we mostly inherited from Erlang).

For Elixir, the benefit of a type system would revolve mostly around contracts. If function caller(arg) calls a function named callee(arg), we want to guarantee that, as both these functions change over time, that caller is passing valid arguments into callee and that the caller properly handles the return types from callee.

This may seem like a simple guarantee to provide, but we’d run into tricky scenarios even on small code samples. For example, imagine that we define a negate function, that negates numbers. One may implement it like this:

def negate(x) when is_integer(x), do: -x

We could then say negate has the type integer() -> integer().

With our custom negation in hand, we can implement a custom subtraction:

def subtract(a, b) when is_integer(a) and is_integer(b) do
  a + negate(b)

This would all work and typecheck as expected, as we are only working with integers. However, imagine in the future someone decides to make negate polymorphic, so it also negates booleans:

def negate(x) when is_integer(x), do: -x
def negate(x) when is_boolean(x), do: not x

If we were to naively say that negate now has the type integer() | boolean() -> integer() | boolean(), we would now get a false positive warning in our implementation of subtract:

Type warning:

  |  def subtract(a, b) when is_integer(a) and is_integer(b) do
  |    a + negate(b)
         ^ the operator + expects integer(), integer() as arguments,
           but the second argument can be integer() | boolean()

So we want a type system that can type contracts between functions but, at the same time, avoids false positives and does not restrict the Elixir language. Balancing those trade-offs is not only a technical challenge but also one that needs to consider the needs of the community. The Dialyzer project, implemented in Erlang and available for Elixir projects, chose to have no false positives. However, that implies certain bugs may not be caught.

At this point in time, it seems the overall community would prefer a system that flags more potential bugs, even if it means more false positives. This may be particularly tricky in the context of Elixir and Erlang because I like to describe them as assertive languages: we write code that will crash in face of unexpected scenarios because we rely on supervisors to restart parts of our application whenever that happens. This is the foundation of building self-healing and fault-tolerant systems in those languages.

On the other hand, this is what makes a type system for Erlang/Elixir so exciting and unique: the ability to deal with failure modes both at compile-time and runtime elegantly. Because at the end of the day, regardless of the type system of your choice, you will run into unexpected scenarios, especially when interacting with external resources such as the filesystem, APIs, distributed nodes, etc.

The big announcement

This brings me to the big announcement from ElixirConf EU 2022: we have an on-going PhD scholarship to research and develop a type system for Elixir based on set-theoretic types. Guillaume Duboc (PhD student) is the recipient of the scholarship, lead by Giuseppe Castagna (Senior Resercher) with support from José Valim (that’s me).

The scholarship is a partnership between the CNRS and Remote. It is sponsored by Supabase (they are hiring!), Fresha (they are hiring!), and Dashbit, all heavily invested in Elixir’s future.

Why set-theoretic types?

We want a type system that can elegantly model all of Elixir idioms and, at a first glance, set-theoretic types were an excellent match. In set-theoretic types, we use set operations to define types and ensure that the types satisfy the associativity and distributivity properties of the corresponding set-theoretic operations.

For example, numbers in Elixir can be integers or floats, therefore we can write them as the union integer() | float() (which is equivalent to float() | integer()).

Remember the negate function we wrote above?

def negate(x) when is_integer(x), do: -x
def negate(x) when is_boolean(x), do: not x

We could think of it as a function that has both types (integer() -> integer()) and (boolean() -> boolean()), which is as an intersection. This would naturally solve the problem described in the previous section: when called with an integer, it can only return an integer.

We also have a data-structure called atoms in Elixir. They uniquely represent a value which is given by their own name. Such as :sunday or :banana. You can think of the type atom() as the set of all atoms. In addition, we can think of the values :sunday and :banana as subtypes of atom(), as they are contained in the set of all atoms. :sunday and :banana are also known as singleton types (as they are made up of only one value).

In fact, we could even consider each integer to be a singleton type that belongs to the integer() set. The choice of which values will become singletons in our type system will strongly depend on the trade-offs we defined in the previous sections.

Furthermore, the type system has to be gradual, as any typed Elixir code would have to interact with untyped Elixir code.

Personally, I find set-theoretical types an elegant and accessible approach to reason about types. At the end of the day, an Elixir developer won’t have to think about intersections when writing a function with multiple clauses, but the modelling is straight-forward if they are ever to look under the hood.

Despite the initial fit between Elixir semantics and set-theoretic types, there are open questions and existing challenges in putting the two together. Here are some examples:

  • Elixir has an expressive collection of idioms used in pattern matching and guards, can we map them all to set-theoretic types?

  • Elixir associative data structures, called maps, can be used both as records and as dictionaries. Would it be possible to also type them with a unified foundation?

  • Gradual type systems must introduce runtime type checks in order to remain sound. However, those type checks will happen in addition to the checks already done by the Erlang VM, which can degrade performance. Therefore, is it possible to leverage the existing runtime checks done by the Erlang VM so the resulting type system is still sound?

Those challenges are precisely what makes me excited to work with Giuseppe Castagna and Guillaume Duboc, as we believe it is important to formalize those problems and their solutions, before we dig deep into the implementation. To get started with set-theoretic types, I recommend Programming with union, intersection, and negation types by Giuseppe Castagna.

Finally, it is important to note there are areas we don’t plan to tackle at the moment, such as typing of messages between processes.

Expectations and roadmap

At this point, you may be expecting that Elixir will certainly become a gradually typed language at some moment in its future. However, it is important to note this may not be the case, as there is a long road ahead of us.

One of the challenges in implementing a type system - at least for someone who doesn’t have the relevant academic background like myself - is that it feels like a single indivisible step: you take a language without a type system and at the end you have one, without much insight or opportunity for feedback in the middle. Therefore, we have been planning to incorporate the type system into Elixir in steps, which I have been referring to as “a gradual gradual type system”: one where we add gradual types to the language gradually.

The first step, the one we are currently working on, is to leverage the existing type information found in Elixir programs. As previously mentioned, we write assertive code in Elixir, which means there is a lot of type information in patterns and guards. We want to lift this information and use it to type check existing codebases. The Erlang compiler already does so to improve performance within a single module and we want to eventually do so across modules and applications too.

During this phase, Elixir developers won’t have to change a single line of code to leverage the benefits of the type system. Of course, we will catch only part of existing bugs, but this will allows us to stress test, benchmark, and collect feedback from developers, making improvements behind the scenes (or even revert the whole thing if we believe it won’t lead us where we expect).

The next step is to introduce typed structs into the language, allowing struct types to propagate throughout the system, as you pattern match on structs throughout the codebase. In this stage we will introduce a new API for defining structs, yet to be discussed, and developers will have to use the new API to reap its benefits.

Then finally, once we are happy with the improvements and the feedback collected, we can migrate to introduce a new syntax for typing function signatures in Elixir codebases, including support for more advanced features such as polymorphic types. Those will allow us to type complex constructs such as the ones found in the Enum module.

The important point to keep in mind is that those features will be explored and developed in steps, with plenty of opportunity to gather community feedback. I also hope our experience may be useful to other ecosystems who wish to gradually introduce type systems into existing programming languages, in a way that feels granular and participative.

Thank you for reading and see you in a future article of the “My Future with Elixir” series.


Erlang/OTP 25.1 Release

OTP 25.1

Erlang/OTP 25.1 is the first maintenance patch package for OTP 25, with mostly bug fixes as well as quite many small improvements.

Below are some highlights of the release:


  • Crypto is now considered to be usable with the OpenSSL 3.0 cryptolib for production code. ENGINE and FIPS are not yet fully functional.

  • Changed the behaviour of the engine load/unload functions


  • A vulnerability has been discovered and corrected. It is registered as CVE-2022-37026 “Client Authentication Bypass”. Corrections have been released on the supported tracks with patches,, and 25.0.2. The vulnerability might also exist in older OTP versions. We recommend that impacted users upgrade to one of these versions or later on the respective tracks. OTP 25.1 would be an even better choice. Impacted are those who are running an ssl/tls/dtls server using the ssl application either directly or indirectly via other applications. For example via inets (httpd), cowboy, etc. Note that the vulnerability only affects servers that request client certification, that is sets the option {verify, verify_peer}.

The Erlang/OTP source can also be found at GitHub on the official Erlang repository, https://github.com/erlang/otp

Download links for this and previous versions are found here


Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.