My Future with Elixir: set-theoretic types

This is a three-articles series on My Future with Elixir, containing excerpts from my keynotes at ElixirConf Europe 2022 and ElixirConf US 2022.

In May 2022, we have celebrated 10 years since Elixir v0.5, the first public release of Elixir, was announced.

At such occasions, it may be tempting to try to predict how Elixir will look in 10 years from now. However, I believe that would be a futile effort, because, 10 years ago, I would never have guessed Elixir would have gone beyond excelling at web development, but also into domains such as embedded software and making inroads into machine learning and data analysis with projects such as Nx (Numerical Elixir), Explorer, Axon and Livebook. Elixir was designed to be extensible and how it will be extended has always been a community effort.

For these reasons, I choose to focus on My Future with Elixir. Those are the projects I am personally excited about and working on alongside other community members. The topic of today’s article is type systems, as discussed in my ElixirConf EU presentation in May 2022.

The elephant in the room: types

Throughout the years, the Elixir Core Team has addressed the biggest needs of the community. Elixir v1.6 introduced the Elixir code formatter, as the growing community and large teams saw an increased need for style guides and conventions around large codebases.

Elixir v1.9 shipped with built-in support for releases: self-contained archives that consist of your application code, all of its dependencies, plus the whole Erlang Virtual Machine (VM) and runtime. The goal was to address the perceived difficulty in deploying Elixir projects, by bringing tried approaches from both Elixir and Erlang communities into the official tooling. This paved the way to future automation, such as mix phx.gen.release, which automatically generates a Dockerfile tailored to your Phoenix applications.

Given our relationship with the community, it would be disingenuous to talk about my future with Elixir without addressing what seems to be the biggest community need nowadays: static typing. However, when the community asks for static typing, what are we effectively expecting? And what is the Elixir community to gain from it?

Types and Elixir

Different programming languages and platforms extract different values from types. These values may or may not apply to Elixir.

For example, different languages can extract performance benefits from types. However, Elixir still runs on the Erlang VM, which is dynamically typed, so we should not expect any meaningful performance gain from typing Elixir code.

Another benefit of types is to aid documentation (emphasis on the word aid as I don’t believe types replace textual documentation). Elixir already reaps similar benefits from typespecs and I would expect an integrated type system to be even more valuable in this area.

However, the upsides and downsides of static typing become fuzzier and prone to exaggerations once we discuss them in the context of code maintainance, in particular when comparing types with other software verification techniques, such as tests. In those situations, it is common to hear unrealistic claims such as “a static type system would catch 80% of my Elixir bugs” or that “you need to write fewer tests once you have static types”.

While I explore why I don’t believe those claims are true during the keynote, saying a static type system helps catch bugs is not helpful unless we discuss exactly the type of bugs it is supposed to identify, and that’s what we should focus on.

For example, Rust’s type system helps prevent bugs such as deallocating memory twice, dangling pointers, data races in threads, and more. But adding such type system to Elixir would be unproductive because those are not bugs that we run into in the first place, as those properties are guaranteed by the garbage collector and the Erlang runtime.

This brings another discussion point: a type system naturally restricts the amount of code we can write because, in order to prove certain properties about our code, certain styles have to be rejected. However, I would prefer to avoid restricting the expressive power of Elixir, because I am honestly quite happy with the language semantics (which we mostly inherited from Erlang).

For Elixir, the benefit of a type system would revolve mostly around contracts. If function caller(arg) calls a function named callee(arg), we want to guarantee that, as both these functions change over time, that caller is passing valid arguments into callee and that the caller properly handles the return types from callee.

This may seem like a simple guarantee to provide, but we’d run into tricky scenarios even on small code samples. For example, imagine that we define a negate function, that negates numbers. One may implement it like this:

def negate(x) when is_integer(x), do: -x

We could then say negate has the type integer() -> integer().

With our custom negation in hand, we can implement a custom subtraction:

def subtract(a, b) when is_integer(a) and is_integer(b) do
  a + negate(b)
end

This would all work and typecheck as expected, as we are only working with integers. However, imagine in the future someone decides to make negate polymorphic, so it also negates booleans:

def negate(x) when is_integer(x), do: -x
def negate(x) when is_boolean(x), do: not x

If we were to naively say that negate now has the type integer() | boolean() -> integer() | boolean(), we would now get a false positive warning in our implementation of subtract:

Type warning:

  |
  |  def subtract(a, b) when is_integer(a) and is_integer(b) do
  |    a + negate(b)
         ^ the operator + expects integer(), integer() as arguments,
           but the second argument can be integer() | boolean()

So we want a type system that can type contracts between functions but, at the same time, avoids false positives and does not restrict the Elixir language. Balancing those trade-offs is not only a technical challenge but also one that needs to consider the needs of the community. The Dialyzer project, implemented in Erlang and available for Elixir projects, chose to have no false positives. However, that implies certain bugs may not be caught.

At this point in time, it seems the overall community would prefer a system that flags more potential bugs, even if it means more false positives. This may be particularly tricky in the context of Elixir and Erlang because I like to describe them as assertive languages: we write code that will crash in face of unexpected scenarios because we rely on supervisors to restart parts of our application whenever that happens. This is the foundation of building self-healing and fault-tolerant systems in those languages.

On the other hand, this is what makes a type system for Erlang/Elixir so exciting and unique: the ability to deal with failure modes both at compile-time and runtime elegantly. Because at the end of the day, regardless of the type system of your choice, you will run into unexpected scenarios, especially when interacting with external resources such as the filesystem, APIs, distributed nodes, etc.

The big announcement

This brings me to the big announcement from ElixirConf EU 2022: we have an on-going PhD scholarship to research and develop a type system for Elixir based on set-theoretic types. Guillaume Duboc (PhD student) is the recipient of the scholarship, lead by Giuseppe Castagna (Senior Resercher) with support from José Valim (that’s me).

The scholarship is a partnership between the CNRS and Remote. It is sponsored by Supabase (they are hiring!), Fresha (they are hiring!), and Dashbit, all heavily invested in Elixir’s future.

Why set-theoretic types?

We want a type system that can elegantly model all of Elixir idioms and, at a first glance, set-theoretic types were an excellent match. In set-theoretic types, we use set operations to define types and ensure that the types satisfy the associativity and distributivity properties of the corresponding set-theoretic operations.

For example, numbers in Elixir can be integers or floats, therefore we can write them as the union integer() | float() (which is equivalent to float() | integer()).

Remember the negate function we wrote above?

def negate(x) when is_integer(x), do: -x
def negate(x) when is_boolean(x), do: not x

We could think of it as a function that has both types (integer() -> integer()) and (boolean() -> boolean()), which is as an intersection. This would naturally solve the problem described in the previous section: when called with an integer, it can only return an integer.

We also have a data-structure called atoms in Elixir. They uniquely represent a value which is given by their own name. Such as :sunday or :banana. You can think of the type atom() as the set of all atoms. In addition, we can think of the values :sunday and :banana as subtypes of atom(), as they are contained in the set of all atoms. :sunday and :banana are also known as singleton types (as they are made up of only one value).

In fact, we could even consider each integer to be a singleton type that belongs to the integer() set. The choice of which values will become singletons in our typesystem will strongly depend on the trade-offs we defined in the previous sections.

Furthermore, the type system has to be gradual, as any typed Elixir code would have to interact with untyped Elixir code.

Personally, I find set-theoretical types an elegant and accessible approach to reason about types. At the end of the day, an Elixir developer won’t have to think about intersections when writing a function with multiple clauses, but the modelling is straight-forward if they are ever to look under the hood.

Despite the initial fit between Elixir semantics and set-theoretic types, there are open questions and existing challenges in putting the two together. Here are some examples:

  • Elixir has an expressive collection of idioms used in pattern matching and guards, can we map them all to set-theoretic types?

  • Elixir associative data structures, called maps, can be used both as records and as dictionaries. Would it be possible to also type them with a unified foundation?

  • Gradual type systems must introduce runtime type checks in order to remain sound. However, those type checks will happen in addition to the checks already done by the Erlang VM, which can degrade performance. Therefore, is it possible to leverage the existing runtime checks done by the Erlang VM so the resulting type system is still sound?

Those challenges are precisely what makes me excited to work with Giuseppe Castagna and Guillaume Duboc, as we believe it is important to formalize those problems and their solutions, before we dig deep into the implementation. To get started with set-theoretic types, I recommend Programming with union, intersection, and negation types by Giuseppe Castagna.

Finally, it is important to note there are areas we don’t plan to tackle at the moment, such as typing of messages between processes.

Expectations and roadmap

At this point, you may be expecting that Elixir will certainly become a gradually typed language at some moment in its future. However, it is important to note this may not be the case, as there is a long road ahead of us.

One of the challenges in implementing a type system - at least for someone who doesn’t have the relevant academic background like myself - is that it feels like a single indivisible step: you take a language without a type system and at the end you have one, without much insight or opportunity for feedback in the middle. Therefore, we have been planning to incorporate the type system into Elixir in steps, which I have been referring to as “a gradual gradual type system”: one where we add gradual types to the language gradually.

The first step, the one we are currently working on, is to leverage the existing type information found in Elixir programs. As previously mentioned, we write assertive code in Elixir, which means there is a lot of type information in patterns and guards. We want to lift this information and use it to type check existing codebases. The Erlang compiler already does so to improve performance within a single module and we want to eventually do so across modules and applications too.

During this phase, Elixir developers won’t have to change a single line of code to levarage the benefits of the type system. Of course, we will catch only part of existing bugs, but this will allows us to stress test, benchmark, and collect feedback from developers, making improvements behind the scenes (or even revert the whole thing if we believe it won’t lead us where we expect).

The next step is to introduce typed structs into the language, allowing struct types to propagate throughout the system, as you pattern match on structs throughout the codebase. In this stage we will introduce a new API for defining structs, yet to be discussed, and developers will have to use the new API to reap its benefits.

Then finally, once we are happy with the improvements and the feedback collected, we can migrate to introduce a new syntax for typing function signatures in Elixir codebases, including support for more advanced features such as polymorphic types. Those will allow us to type complex constructs such as the ones found in the Enum module.

The important point to keep in mind is that those features will be explored and developed in steps, with plenty of opportunity to gather community feedback. I also hope our experience may be useful to other ecosystems who wish to gradually introduce type systems into existing programming languages, in a way that feels granular and participative.

Thank you for reading and see you in a future article of the “My Future with Elixir” series.

Permalink

Erlang/OTP 25.1 Release

OTP 25.1

Erlang/OTP 25.1 is the first maintenance patch package for OTP 25, with mostly bug fixes as well as quite many small improvements.

Below are some highlights of the release:

crypto:

  • Crypto is now considered to be usable with the OpenSSL 3.0 cryptolib for production code. ENGINE and FIPS are not yet fully functional.

  • Changed the behaviour of the engine load/unload functions

ssl:

  • A vulnerability has been discovered and corrected. It is registered as CVE-2022-37026 “Client Authentication Bypass”. Corrections have been released on the supported tracks with patches 23.3.4.15, 24.3.4.2, and 25.0.2. The vulnerability might also exist in older OTP versions. We recommend that impacted users upgrade to one of these versions or later on the respective tracks. OTP 25.1 would be an even better choice. Impacted are those who are running an ssl/tls/dtls server using the ssl application either directly or indirectly via other applications. For example via inets (httpd), cowboy, etc. Note that the vulnerability only affects servers that request client certification, that is sets the option {verify, verify_peer}.

The Erlang/OTP source can also be found at GitHub on the official Erlang repository, https://github.com/erlang/otp

Download links for this and previous versions are found here

Permalink

Elixir v1.14 released

Elixir v1.14 has just been released. 🎉

Let’s check out new features in this release. Like many of the past Elixir releases, this one has a strong focus on developer experience and developer happiness, through improvements to debugging, new debugging tools, and improvements to term inspection. Let’s take a quick tour.

Note: this announcement contains asciinema snippets. You may need to enable 3rd-party JavaScript on this site in order to see them. If JavaScript is disabled, noscript tags with the proper links will be shown.

dbg

Kernel.dbg/2 is a new macro that’s somewhat similar to IO.inspect/2, but specifically tailored for debugging.

When called, it prints the value of whatever you pass to it, plus the debugged code itself as well as its location.

<noscript><p><a href="https://asciinema.org/a/510632">See the example in asciinema</a></p></noscript>

dbg/2 can do more. It’s a macro, so it understands Elixir code. You can see that when you pass a series of |> pipes to it. dbg/2 will print the value for every step of the pipeline.

<noscript><p><a href="https://asciinema.org/a/509506">See the example in asciinema</a></p></noscript>

IEx + dbg

Interactive Elixir (IEx) is Elixir’s shell (also known as REPL). In v1.14, we have improved IEx breakpoints to also allow line-by-line stepping:

<noscript><p><a href="https://asciinema.org/a/508048">See the example in asciinema</a></p></noscript>

We have also gone one step further and integrated this new functionality with dbg/2.

dbg/2 supports configurable backends. IEx automatically replaces the default backend by one that halts the code execution with IEx:

<noscript><p><a href="https://asciinema.org/a/509507">See the example in asciinema</a></p></noscript>

We call this process “prying”, as you get access to variables and imports, but without the ability to change how the code actually executes.

This also works with pipelines: if you pass a series of |> pipe calls to dbg (or pipe into it at the end, like |> dbg()), you’ll be able to step through every line in the pipeline.

<noscript><p><a href="https://asciinema.org/a/509509">See the example in asciinema</a></p></noscript>

You can keep the default behavior by passing the --no-pry option to IEx.

dbg in Livebook

Livebook brings the power of computation notebooks to Elixir.

As an another example of the power behind dbg, the Livebook team has implemented a visual representation for dbg as a backend, where it prints each step of the pipeline as its distinct UI element. You can select an element to see its output or even re-order and disable sections of the pipeline on the fly:

PartitionSupervisor

PartitionSupervisor implements a new supervisor type. It is designed to help when you have a single supervised process that becomes a bottleneck. If that process’ state can be easily partitioned, then you can use PartitionSupervisor to supervise multiple isolated copies of that process running concurrently, each assigned its own partition.

For example, imagine you have a ErrorReporter process that you use to report errors to a monitoring service.

# Application supervisor:
children = [
  # ...,
  ErrorReporter
]

Supervisor.start_link(children, strategy: :one_for_one)

As the concurrency of your application goes up, the ErrorReporter process might receive requests from many other processes and eventually become a bottleneck. In a case like this, it could help to spin up multiple copies of the ErrorReporter process under a PartitionSupervisor.

# Application supervisor
children = [
  {PartitionSupervisor, child_spec: ErrorReporter, name: Reporters}
]

The PartitionSupervisor will spin up a number of processes equal to System.schedulers_online() by default (most often one per core). Now, when routing requests to ErrorReporter processes we can use a :via tuple and route the requests through the partition supervisor.

partitioning_key = self()
ErrorReporter.report({:via, PartitionSupervisor, {Reporters, partitioning_key}}, error)

Using self() as the partitioning key here means that the same process will always report errors to the same ErrorReporter process, ensuring a form of back-pressure. You can use any term as the partitioning key.

A common example

A common and practical example of a good use case for PartitionSupervisor is partitioning something like a DynamicSupervisor. When starting many processes under it, a dynamic supervisor can be a bottleneck, especially if said processes take a long time to initialize. Instead of starting a single DynamicSupervisor, you can start multiple:

children = [
  {PartitionSupervisor, child_spec: DynamicSupervisor, name: MyApp.DynamicSupervisors}
]

Supervisor.start_link(children, strategy: :one_for_one)

Now you start processes on the dynamic supervisor for the right partition. For instance, you can partition by PID, like in the previous example:

DynamicSupervisor.start_child(
  {:via, PartitionSupervisor, {MyApp.DynamicSupervisors, self()}},
  my_child_specification
)

Improved errors on binaries and evaluation

Erlang/OTP 25 improved errors on binary construction and evaluation. These improvements apply to Elixir as well. Before v1.14, errors when constructing binaries would often be hard-to-debug, generic “argument errors”. Erlang/OTP 25 and Elixir v1.14 provide more detail for easier debugging. This work is part of EEP 54.

Before:

int = 1
bin = "foo"
int <> bin
#=> ** (ArgumentError) argument error

Now:

int = 1
bin = "foo"
int <> bin
#=> ** (ArgumentError) construction of binary failed:
#=>    segment 1 of type 'binary':
#=>    expected a binary but got: 1

Code evaluation (in IEx and Livebook) has also been improved to provide better error reports and stacktraces.

Slicing with Steps

Elixir v1.12 introduced stepped ranges, which are ranges where you can specify the “step”:

Enum.to_list(1..10//3)
#=> [1, 4, 7, 10]

Stepped ranges are particularly useful for numerical operations involving vectors and matrices (see Nx, for example). However, the Elixir standard library was not making use of stepped ranges in its APIs. Elixir v1.14 starts to take advantage of steps with support for stepped ranges in a couple of functions. One of them is Enum.slice/2:

letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]
Enum.slice(letters, 0..5//2)
#=> ["a", "c", "e"]

binary_slice/2 (and binary_slice/3 for completeness) has been added to the Kernel module, that works with bytes and also support stepped ranges:

binary_slice("Elixir", 1..5//2)
#=> "lxr"

Expression-based Inspection and Inspect Improvements

In Elixir, it’s conventional to implement the Inspect protocol for opaque structs so that they’re inspected with a special notation, resembling this:

MapSet.new([:apple, :banana])
#MapSet<[:apple, :banana]>

This is generally done when the struct content or part of it is private and the %name{...} representation would reveal fields that are not part of the public API.

The downside of the #name<...> convention is that the inspected output is not valid Elixir code. For example, you cannot copy the inspected output and paste it into an IEx session.

Elixir v1.14 changes the convention for some of the standard-library structs. The Inspect implementation for those structs now returns a string with a valid Elixir expression that recreates the struct when evaluated. In the MapSet example above, this is what we have now:

fruits = MapSet.new([:apple, :banana])
MapSet.put(fruits, :pear)
#=> MapSet.new([:apple, :banana, :pear])

The MapSet.new/1 expression evaluates to exactly the struct that we’re inspecting. This allows us to hide the internals of MapSet, while keeping it as valid Elixir code. This expression-based inspection has been implemented for Version.Requirement, MapSet, and Date.Range.

Finally, we have improved the Inspect protocol for structs so that fields are inspected in the order they are declared in defstruct. The option :optional has also been added when deriving the Inspect protocol, giving developers more control over the struct representation. See the updated documentation for Inspect for a general rundown on the approaches and options available.

Learn more

For a complete list of all changes, see the full release notes.

Check the Install section to get Elixir installed and read our Getting Started guide to learn more.

Happy debugging!

Permalink

Debugging a Slow Starting Elixir Application

I recently had to fix an Elixir service that was slow to start. I was able to pinpoint the issue with only a few commands and I want to share a couple of the things I learned.

Dependencies

In Elixir all dependencies are “applications”. The term “application” means something different than it does outside of Elixir. In Elixir an “application” is a set of modules and behaviors. Some of these applications define their own supervision trees and must be started by Application.start/2 before they can be used. When you start your Elixir service, either via Mix or a generated Elixir release, the dependencies you specified in your mix.exs file will be started before your own code is started. If an application listed as a dependency is slow to start your application must wait until the dependency is running before it can be started.

While the behavior is simple it is recursive. Each application has its own set of dependencies that must be running before that application can be started, and some of those dependencies have dependencies of their own that must be running before they can start. This results in a dependency tree structure. To illustrate this with a little ASCII:

- your_app
  - dependency_1
    - hidden_dependency_1
    - hidden_dependency_2
  - dependency_2
    - hidden_dependency_3

For this application, the Erlang VM would likely start these applications in this order:

  1. hidden_dependency_3

  2. dependency_2

  3. hidden_dependency_2

  4. hidden_dependency_1

  5. dependency_1

  6. your_app

The application I had to fix had a lot of dependencies. Profiling each application would be tedious and time-consuming, and I had a hunch there was probably a single dependency that was the problem. Turns out it’s pretty easy to write a little code that times the start up of each application.

Profiling

Start an IEx shell the --no-start flag so that the application is available but not yet loaded or started:

iex -S mix run --no-start

Then load this code into the shell:

defmodule StartupBenchmark do
  def run(application) do
    complete_deps = deps_list(application) # (1)

    dep_start_times = Enum.map(complete_deps, fn(app) -> # (2)
      case :timer.tc(fn() -> Application.start(app) end) do
        {time, :ok} -> {time, app}
        # Some dependencies like :kernel may have already been started, we can ignore them
        {time, {:error, {:already_started, _}}} -> {time, app}
        # Raise an exception if we get an non-successful return value
        {time, error} -> raise(error)
      end
    end)

    dep_start_times
    |> Enum.sort() # (3)
    |> Enum.reverse()
  end

  defp deps_list(app) do
    # Get all dependencies for the app
    deps = Application.spec(app, :applications)

    # Recursively call to get all sub-dependencies
    complete_deps = Enum.map(deps, fn(dep) -> deps_list(dep) end)

    # Build a complete list of sub dependencies, with the top level application
    # requiring them listed last, also remove any duplicates
    [complete_deps, [app]]
    |> List.flatten()
    |> Enum.uniq()
  end
end

To highlight the important pieces from this module:

  1. Recursively get all applications that must be started in the order they need to be started in.

  2. Start each application in order; timing each one.

  3. Sort applications by start time so the slowest application is the first item in the list.

With this code finding applications that are slow to start is easy:

> StartupBenchmark.run(:your_app)
[
  {6651969, :prometheus_ecto},
  {19621, :plug_cowboy},
  {14336, :postgrex},
  {13598, :ecto_sql},
  {5123, :yaml_elixir},
  {3871, :phoenix_live_dashboard},
  {1159, :phoenix_ecto},
  {123, :prometheus_plugs},
  {64, :ex_json_logger},
  {56, :prometheus_phoenix},
  {56, :ex_ops},
  {36, :kernel},
  ...
]

These times are in microseconds so in this case prometheus_ecto is taking 6.6 seconds to start. All other applications are taking less than 20 milliseconds to start and many of them are taking less than 1 millisecond to start. prometheus_ecto is the culprit here.

Conclusion

With the code above I was able to identify prometheus_ecto as the problem. With this information I was then able to use eFlambe and a few other tools to figure out why prometheus_ecto was so slow and quickly fixed the issue.

I hope the snippet of code above will be helpful to some of you. If you like reading my blog posts please subscribe to my newsletter. I send emails out once a month with my latest posts.

Permalink

My favorite Erlang Container

2022/07/09

My favorite Erlang Container

Joe Armstrong wrote a blog post titled My favorite Erlang Program, which showed a very simple universal server written in Erlang:

universal_server()->receive{become,F}->F()end.

You could then write a small program that could fit this function F:

factorial_server()->receive{From,N}->From!factorial(N),factorial_server()end.factorial(0)->1;factorial(N)->N*factorial(N-1).

If you had an already running universal server, such as you would by having called Pid = spawn(fun universal_server/0), you could then turn that universal server into the factorial server by calling Pid ! {become, fun factorial_server/0}.

Weeds growing in driveway cracks

Joe Armstrong had a way to get to the essence of a lot of concepts and to think about programming as a fun thing. Unfortunately for me, my experience with the software industry has left me more or less frustrated with the way things are, even if the way things are is for very good reasons. I really enjoyed programming Erlang professionally, but I eventually got sidetracked by other factors that would lead to solid, safe software—mostly higher level aspects of socio-technical systems, and I became SRE.

But a part of me still really likes dreaming about the days where I could do hot code loading over entire clusters—see A Pipeline Made of Airbags—and I kept thinking about how I could bring this back, but within the context of complex software teams running CI/CD, building containers, and running them in Kubernetes. This is no short order, because we now have decades of lessons telling everyone that you want your infrastructure to be immutable and declared in code.

I also have a decade of experience telling me a lot of what we've built is a frustrating tower of abstractions over shaky ground. I know I've experienced better, my day job is no longer about slinging my own code, and I have no pretense of respecting the tower of abstraction itself.

A weed is a plant considered undesirable in a particular situation, "a plant in the wrong place". Examples commonly are plants unwanted in human-controlled settings, such as farm fields, gardens, lawns, and parks. Taxonomically, the term "weed" has no botanical significance, because a plant that is a weed in one context is not a weed when growing in a situation where it is wanted.

Like the weeds that decide that the tiniest crack in a driveway is actually real cool soil to grow in, I've decided to do the best thing given the situation and bring back Erlang live code upgrades to modern CI/CD, containerized and kubernetized infrastructure.

If you want the TL:DR; I wrote the dandelion project, which shows how to go from an Erlang/OTP app, automate the generation of live code upgrade instructions with the help of pre-existing tools and CI/CD, generate a manifest file and store build artifacts, and write the necessary configuration to have Kubernetes run said containers and do automated live code upgrades despite its best attempts at providing immutable images. Then I pepper in some basic CI scaffolding to make live code upgrading a tiny bit less risky. This post describes how it all works.

A Sample App

A lot of "no downtime" deployments you'll find for Kubernetes are actually just rolling updates with graceful connection termination. Those are always worth supporting (even if it can be annoying to get right), but it has a narrow definition of downtime that's different from what we're aiming for here: no need to restart the application, dump and re-hydrate the state, nor to drop a single connection.

A small server with persistent connections

I wrote a really trivial application, nothing worth calling home about. It's a tiny TCP server with a bunch of acceptors where you can connect with netcat (nc localhost 8080) and it just displays stuff. This is the bare minimum to show actual "no downtime": a single instance, changing its code definitions in a way that is directly observable to a client, without ever dropping a connection.

The application follows a standard release structure for Erlang, using Rebar3 as a build tool. Its supervision structure looks like this:

               top level
              supervisor
               /      \
        connection   acceptors
        supervisor   supervisor
            |            |
        connection    acceptor
         workers        pool
       (1 per conn)

The acceptors supervisor starts a TCP listen socket, which is passed to each worker in the acceptor pool. Upon each accepted connection, a connection worker is started and handed the final socket. The connection worker then sits in an event loop. Every second it sends in a small ASCII drawing of a dandelion, and for every packet of data it receives (coalesced or otherwise), it sends in a line containing its version.

A netcat session looks like this:

$ nc localhost 8080

   @
 \ |
__\!/__


   @
 \ |
__\!/__

> ping!
vsn: 0.1.5

   @
 \ |
__\!/__

^C

Universal container: a plan

Modern deployments are often done with containers and Kubernetes. I'm assuming you're familiar with both of these concepts, but if you want more information—in the context of Erlang—then Tristan Sloughter wrote a great article on Docker and Erlang and another one on using Kubernetes for production apps.

In this post, I'm interested in doing two things:

  1. Have an equivalent to Joe Armstrong's Universal server
  2. Force the immutable world to nevertheless let me do live code upgrades

The trick here is deceptively simple, enough to think "that can't be a good idea." It probably isn't.

A tracing of Nathan fielder

The plan? Use a regular container someone else maintains and just wedge my program's tarball in there. I can then use a sidecar to automate fetching updates and applying live code upgrades without Kubernetes knowing anything about it.

Erlang Releases: a detour

To understand how this can work, we first need to cover the basics of Erlang releases. A good overview of the Erlang Virtual Machine's structure is an article I've written in OTP at a high level, but that I can summarize here by describing the following layers, from lowest to highest:

  • There is an Erlang Runtime System (ERTS), which is essentially the VM itself, written in C. This can't be live upgraded, and offers features around Erlang's immutability, preemptive scheduling, memory allocation, garbage collection, and so on.
  • A few pre-loaded modules that offer core functionality around files and sockets. I don't believe that these get to be live upgraded either.
  • There is a pair of libraries ("applications" in Erlang-speak) called kernel and stdlib that define the most basic libraries around list processing, distributed programs, and the core of "OTP", the general development framework in the language. These can be live upgraded, but pretty much nobody does that
  • Then we have the Erlang standard library. This includes things such as TLS support, HTTP clients and servers, Wx bindings, test frameworks, the compiler, and extra scaffolding around OTP niceties, to name a few.

Your own project is pretty much just your own applications ("libraries"), bundled with select standard library libraries and a copy of the Erlang system:

release schematic drawing

The end result of this sort of ordeal is that every Erlang project is pretty much people writing their own libraries (in blue in the drawing above), fetching bits from the Erlang base install (in red in the drawing above), and then using tools (such as Rebar3) to repackage everything into a brand new Erlang distribution. A detailed explanation of how this happens is also in Adopting Erlang.

Whatever you built on one system can be deployed on an equivalent system. If you built your app on a 64 bit linux—and assuming you used static libraries for OpenSSL or LibreSSL, or have equivalent ones on a target system—then you can pack a tarball, unpack it on the other host system, and get going. Those requirements don't apply to your code, only the standard library. If you don't use NIFs or other C extensions, your own Erlang code, once built, is fully portable.

The cool thing is that Erlang supports making a sort of "partial" release, where you take the Erlang/OTP part (red) and your own apps (blue) in the above image, only package your own app's and a sort of "figure out the Erlang/OTP part at run-time" instruction, and your application is going to be entirely portable across all platforms (Windows, Linux, BSD, MacOS) and supported architectures (x86, ARM32, ARM64, etc.)

I'm mentioning this because for the sake of this experiment, I'm running things locally on a M1 macbook air (ARM64), with MicroK8s (which runs a Linux/aarch64), but am using Github Actions, which are on an x86 linux. So rather than using a base ubuntu image and then needing to run the same sort of hardware family everywhere down the build chain, I'll be using an Erlang image from dockerhub to provide the ERTS and stdlib, and will then have the ability to make portable builds from either my laptop or github actions and deploy them onto any Kubernetes cluster—something noticeably nicer than having to deal with cross-compilation in any language.

Controlling Releases

The release definition for Dandelion accounts for the above factors, and looks like this:

{relx,[{release,{dandelion,"0.1.5"},% my release and its version (dandelion-0.1.5)[dandelion,% includes the 'dandelion' app and its depssasl]},% and the 'sasl' library, which is needed for% live code upgrades to work%% set runtime configuration values based on the environment in this file{sys_config_src,"./config/sys.config.src"},%% set VM options in this file{vm_args,"./config/vm.args"},%% drop source files and the ERTS, but keep debug annotations%% which are useful for various tools, including automated%% live code upgrade plugins{include_src,false},{include_erts,false},{debug_info,keep},{dev_mode,false}]}.

The release can be built by calling rebar3 release and packaged by calling rebar3 tar.

Take the resulting tarball, unpack it, and you'll get a bunch of directories: lib/ contains the build artifact for all libraries, releases/ contains metadata about the current release version (and the structure to store future and past versions when doing live code upgrades), and finally the bin/ directory contains a bunch of accessory scripts to load and run the final code.

Call bin/dandelion and a bunch of options show up:

$ bin/dandelion
Usage: dandelion [COMMAND] [ARGS]

Commands:

  foreground              Start release with output to stdout
  remote_console          Connect remote shell to running node
  rpc [Mod [Fun [Args]]]] Run apply(Mod, Fun, Args) on running node
  eval [Exprs]            Run expressions on running node
  stop                    Stop the running node
  restart                 Restart the applications but not the VM
  reboot                  Reboot the entire VM
...
  upgrade [Version]       Upgrade the running release to a new version
  downgrade [Version]     Downgrade the running release to a new version
  install [Version]       Install a release
  uninstall [Version]     Uninstall a release
  unpack [Version]        Unpack a release tarball
  versions                Print versions of the release available
...

So in short, your program's lifecycle can become:

  • bin/dandelion foreground boots the app (use bin/dandelion console to boot a version in a REPL)
  • bin/dandelion remote_console pops up a REPL onto the running app by using distributed Erlang

If you're doing the usual immutable infrastructure, that's it, you don't need much more. If you're doing live code upgrades, you then have a few extra steps:

  1. Write a new version of the app
  2. Give it some instructions about how to do its live code upgrade
  3. Pack that in a new version of the release
  4. Put the tarball in releases/
  5. Call bin/dandelion unpack <version> and the Erlang VM will unpack the new tarball into its regular structure
  6. Call bin/dandelion install <version> to get the Erlang VM in your release to start tracking the new version (without switching to it)
  7. Call bin/dandelion upgrade <version> to apply the live code upgrade

And from that point on, the new release version is live.

Hot Code Upgrade Instructions

I've sort of papered over the complexity required to "give it some instructions about how to do its live code upgrade." This area is generally really annoying and complex. You first start with appup files, which contain instructions on upgrading individual libraries, which are then packaged into a relup which provides instructions for coordinating the overall upgrade.

If you're running live code upgrades on a frequent basis you may want to get familiar with these, but most people never bothered, and the vast majority of live code upgrades are done by people writing manual scripts to load specific modules.

A very nice solution that also exists is to use Luis Rascão's rebar3_appup_plugin which will take two releases, compare their code, and auto-generate instructions on your behalf. By using it, most of the annoyances and challenges are automatically covered for you.

All you need to do is to make sure all versions are adequately bumped, do a few command line invocations, and package it up. This will be a prime candidate for automation soon in this post.

For now though, let's assume we'll just put the release in an S3 bucket that the kubernetes cluster has access to, and build our infrastructure on the Kubernetes side.

Universal container: a kubernetes story

Let's escape the Erlang complexity and don our DevOps hat. We now want to run the code we assume has made it safely to S3. All of it beautifully holds into a single YAML file—which, granted, can't really be beautiful on its own. I use three containers in a single kubernetes pod:

containers schematic drawing

All of these containers will share a 'release' directory, by using an EmptyDir volume. The bootstrap container will fetch the latest release and unpack it there, the dandelion-release container will run it, and the sidecar will be able to interact over the network to manage live code upgrades.

The bootstrap pod runs first, and fetches the first (and current) release from S3. I'm doing so by assuming we'll have a manifest file (<my-s3-bucket>/dandelion-latest) that contains a single version number that points to the tarball I want (<my-s3-bucket>/dandelion-<version>.tar.gz). This can be done with a shell script:

#!/usr/bin/env bashset -euxo pipefail
RELDIR=${1:-/release}S3_URL="https://${BUCKET_NAME}.s3.${AWS_REGION}.amazonaws.com"TAG=$(curl "${S3_URL}/${RELEASE}-latest" -s)
wget -nv "${S3_URL}/${RELEASE}-${TAG}.tar.gz" -O "/tmp/${RELEASE}-${TAG}.tar.gz"
tar -xvf "/tmp/${RELEASE}-${TAG}.tar.gz" -C ${RELDIR}
rm "/tmp/${RELEASE}-${TAG}.tar.gz"

This fetches the manifest, grabs the tag, fetches the release, unpacks it, and deletes the old tarball. The dandelion-release container, which will run our main app, can then just call the bin/dandelion script directly:

#!/usr/bin/env bashset -euxo pipefail
RELDIR=${1:-/release}exec${RELDIR}/bin/${RELEASE} foreground

The sidecar is a bit more tricky, but can reuse the same mechanisms. Every time interval (or based on a feature flag or some server-sent signal), check the manifest, and apply the unpacking steps. Something a bit like:

#!/usr/bin/env bashset -euxo pipefail
RELDIR=${2:-/release}S3_URL="https://${BUCKET_NAME}.s3.${AWS_REGION}.amazonaws.com"CURRENT=$(${RELDIR}/bin/${RELEASE} versions | awk '$3=="permanent" && !vsn { vsn=$2 } $3=="current" { vsn=$2 } END { print vsn }')TAG=$(curl "${S3_URL}/${RELEASE}-latest" -s)if[["${CURRENT}" !="${TAG}"]];then
    wget -nv "${S3_URL}/${RELEASE}-${TAG}.tar.gz" -O "${RELDIR}/releases/${RELEASE}-${TAG}.tar.gz"${RELDIR}/bin/${RELEASE} unpack ${TAG}${RELDIR}/bin/${RELEASE} install ${TAG}${RELDIR}/bin/${RELEASE} upgrade ${TAG}fi

Call this in a loop and you're good to go.

Now here's the fun bit: ConfigMaps are a Kubernetes thing that lets you take arbitrary metadata, and optionally use them as files into pods. This is how we get close to our universal container.

By declaring the three scripts above as a ConfigMap and mounting them in a /scripts directory, we can then declare the 3 containers in a generic fashion:

initContainers:-name:dandelion-bootstrapimage:erlang:25.0.2env:-...volumeMounts:-name:releasemountPath:/release-name:scriptsmountPath:/scriptscommand:-/scripts/init-latest.sh# Regular containers run nextcontainers:-name:dandelion-releaseimage:erlang:25.0.2env:-...volumeMounts:-name:releasemountPath:/release-name:scriptsmountPath:/scriptscommand:-/scripts/boot-release.shports:-containerPort:8080hostPort:8080-name:dandelion-sidecarimage:erlang:25.0.2env:-...volumeMounts:-name:releasemountPath:/release-name:scriptsmountPath:/scriptscommand:-/scripts/update-loop.sh

The full file has more details, but this is essentially all we need. You could kubectl apply -f dandelion.yaml and it would get going on its own. The rest is about providing a better developer experience.

Making it Usable

What we have defined now is an expected format and procedure from Erlang's side to generate code and live upgrade instructions, and a wedge to make this usable within Kubernetes' own structure. This procedure is somewhat messy, and there are a lot of technical aspects that need to be coordinated to make this usable.

Now comes the time to work around providing a useful workflow for this.

Introducing Smoothver

Semver's alright. Most of the time I won't really care about it, though. I'll go read the changelog and see if whatever I depend on has changed or not. People will pick versions for whichever factor they want, and they'll very often put a small breaking change (maybe a bug fix!) as non-breaking because there's an intent being communicated by the version.

Here the semver semantics are not useful. I've just defined a workflow that mostly depends on whether the server can be upgraded live or not, with some minor variations. This operational concern is likely to be the main concern of engineers who would work on such an application daily, particularly since as someone deploying and maintaining server-side software, I mostly own the whole pipeline and always consider the main branch to be canonical.

As such, I should feel free to develop my own versioning scheme. Since I'm trying to reorient Dandelion's whole flow towards continuous live delivery, my versioning scheme should actually reflect and support that effort. I therefore introduce Smoothver (Smooth Versioning):

Given a version number RESTART.RELUP.RELOAD, increment the:

  • RESTART version when you make a change that requires the server to be rebooted.
  • RELUP version when you make a change that requires pausing workers and migrating state.
  • RELOAD version when you make a change that requires reloading modules with no other transformation.

The version number now communicates the relative expected risk of a deployment in terms of disruptiveness, carries some meaning around the magnitude of change taking place, and can be leveraged by tooling.

For example:

  • Fixing a bug in a data structure is a RELOAD deploy so long as the internal representation does not change (e.g. swapping a > for a >= in a comparison)
  • Adding a new endpoint or route to an existing HTTP API is likely a RELOAD deploy since no existing state relies on it. Rolling it back is a business concern, not a technical one.
  • Adding or changing a field in a data structure representing a user is a RELUP operation, since rolling forward or backward implies a data transformation to remain compatible
  • Upgrading the VM version is a RESTART because the C code of the VM itself can't change
  • Bumping a stateful dependency that does not provide live code upgrades forces a RESTART version bump

As with anything we do, the version bump may be wrong. But it at least carries a certain safety level in letting you know that a RESTART live code upgrade should absolutely not be attempted.

Engineers who get more familiar with live code upgrades will also learn some interesting lessons. For example, a RELUP change over a process that has tens of thousands of copies of itself may take a long long time to run and be worse than a rolling upgrade. An interesting thing you can do then is turn RELUP changes (which would require calling code change instructions) into basic code reloads by pattern matching an old structure and converting it on each call, turning it into a somewhat stateless roll-forward affair.

That's essentially converting operational burdens into dirtier code, but this sort of thing is something you do all the time with database migrations (create a new table, double-write, write only to the new one, delete the old table) and that can now be done with running code.

For a new development workflow that tries to orient itself towards live code upgrades, Smoothver is likely to carry a lot more useful information than Semver would (and maybe could be nice for database migrations as well, since they share concerns).

Publishing the Artifacts

I needed to introduce the versioning mechanism because the overall publication workflow will obey it. If you're generating a new release version that requires a RESTART bump, then don't bother generating live code upgrade instructions. If you're generating anything else, do include them.

I've decided to center my workflow around git tags. If you tag your release v1.2.3, then v1.2.4 or v1.4.1 all do a live code upgrade, but v2.0.0 won't, regardless of which branch they go to. The CI script is not too complicated, and is in three parts:

  1. Fetch the currently deployed manifest, and see if the newly tagged version requires a live code upgrade ("relup") or not
  2. Build the release tarball with the relup instructions if needed. Here I rely purely on Luis's plugin to handle all the instructions.
  3. Put the files on S3

That's really all there is to it. I'm assuming that if you wanted to have more environments, you could setup gitops by having more tags (staging-v1.2.3, prod-v1.2.5) and more S3 buckets or paths. But everything is assumed to be driven by these builds artifacts.

A small caveat here is that it's technically possible to generate upgrade instructions (appup files) that map from many to many versions: how to update to 1.2.0 from 1.0.0, 1.0.1, 1.0.2, and so on. Since I'm assuming a linear deployment flow here, I'm just ignoring that and always generating pairs from "whatever is in prod" to "whatever has been tagged". There are obvious race conditions in doing this, where two releases generated in parallel can specify upgrade rules from a shared release, but could be applied and rolled out in a distinct order.

Using the Manifest and Smoothver

Relying on the manifest and versions is requiring a few extra lines in the sidecar's update loop. They look at the version, and if it's a RESTART bump or an older release, they ignore it:

# Get the running versionCURRENT=$(${RELDIR}/bin/${RELEASE} versions | awk '$3=="permanent" && !vsn { vsn=$2 } $3=="current" { vsn=$2 } END { print vsn }')TAG=$(curl "${S3_URL}/${RELEASE}-latest" -s)if[["${CURRENT}" !="${TAG}"]];thenIS_UPGRADE=$(echo"$TAG$CURRENT"| awk -vFS='[. ]''($1==$4 && $2>$5) || ($1==$4 && $2>=$5 && $3>$6) {print 1; exit} {print 0}')if[[$IS_UPGRADE -eq 1]];then
      wget -nv "${S3_URL}/${RELEASE}-${TAG}.tar.gz" -O "${RELDIR}/releases/${RELEASE}-${TAG}.tar.gz"${RELDIR}/bin/${RELEASE} unpack ${TAG}${RELDIR}/bin/${RELEASE} install ${TAG}${RELDIR}/bin/${RELEASE} upgrade ${TAG}fifi

There's some ugly awk logic, but I wanted to not host images. The script could be made a lot more solid by looking at whether we're bumping from the proper version to the next one, and in this it shares a sort of similar race condition to the generation step.

On the other hand, the install step looks at the specified upgrade instructions and will refuse to apply itself (resulting in a sidecar crash) if a bad release is applied.

I figure that alerting on crashed sidecars could be used to drive further automation to ask to delete and replace the pods, resulting in a rolling upgrade. Alternatively, the error itself could be used to trigger a failure in liveness and/or readiness probes, and force-automate that replacement. This is left as an exercise to the reader, I guess. The beauty of writing prototypes is that you can just decide this to be out of scope and move on, and let someone who's paid to operationalize that stuff to figure out the rest.

Oh and if you just change the Erlang VM's version? That changes the kubernetes YAML file, and if you're using anything like helm or some CD system (like ArgoCD), these will take care of running the rolling upgrade for you. Similarly, annotating the chart with a label of some sort indicating the RESTART version will accomplish the same purpose.

You may rightfully ask whether it is a good idea to bring mutability of this sort to a containerized world. I think that using S3 artifacts isn't inherently less safe than a container registry, dynamic feature flags, or relying on encryption services or DNS records for functional application state. I'll leave it at that.

Adding CI validation

Versioning things is really annoying. Each OTP app and library, and each release needs to be versioned properly. And sometimes you change dependencies and these dependencies won't have relup instructions available but you didn't know and that would break your live code upgrade.

What we can do is add a touch of automation to catch the most obvious failure situations and warn developers early about these issues. I've done so by adding a quick relup CI step to all pull requests, by using a version check script that encodes most of that logic.

The other thing I started experimenting with was setting up some sort of test suite for live code upgrades:

# This here step is a working sample, but if you were to run a more# complex app with external dependencies, you'd also have to do a# more intricate multi-service setup here, e.g.:# https://github.com/actions/example-services-name:Run relup applicationworking-directory:erlangrun:|mkdir relupcitar -xvf "${{ env.OLD_TAR }}" -C relupci# use a simple "run the task in the background" setuprelupci/bin/dandelion daemonTAG=$(echo "${{ env.NEW_TAR }}"  | sed -nr 's/^.*([0-9]+\.[0-9]+\.[0-9]+)\.tar\.gz$/\1/p')cp "${{ env.NEW_TAR }}" relupci/releases/relupci/bin/dandelion unpack ${TAG}relupci/bin/dandelion install ${TAG}relupci/bin/dandelion upgrade ${TAG}relupci/bin/dandelion versions

The one thing that would make this one a lot cooler is to write a small extra app or release that runs in the background while the upgrade procedure goes on. It could do things like:

  • generate constant load
  • run smoke tests on critical workflows
  • maintain live connections to show lack of failures
  • report on its state on demand

By starting that process before the live upgrade and questioning it after, we could ensure that the whole process went smoothly. Additional steps could also look at logs to know if things were fine.

The advantage of adding CI here is that each pull request can take measures to ensure it is safely upgradable live before being merged to main, even if none of them are deployed right away. By setting that gate in place, engineers are getting a much shorter feedback loop asking them to think about live deployments.

Running through a live code upgrade

I've run through a few iterations to test and check everything. I've set up microk8s on my laptop, ran kubectl -f apply dandelion.yaml and showed that the pod was up and running fine:

$ kubectl -n dandelion get pods
NAME                                    READY   STATUS    RESTARTS   AGE
dandelion-deployment-648db88f44-49jl8   2/2     Running   0          25H

It is possible to run into one of the containers, log on onto a REPL, and see what is going on:

$ kubectl-ndandelionexec-i-tdandelion-deployment-648db88f44-49jl8-cdandelion-sidecar--/bin/bashroot@dandelion-deployment-648db88f44-49jl8:/#/release/bin/dandelionremote_consoleErlang/OTP25[erts-13.0.2][source][64-bit][smp:2:2][ds:2:2:10][async-threads:1][jit]EshellV13.0.2(abortwith^G)(dandelion@localhost)1>release_handler:which_releases().[{"dandelion","0.1.5",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.5","sasl-4.2"],permanent},{"dandelion","0.1.4",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.4","sasl-4.2"],old}]

This shows that the container had been running for a day, and already had two releases—it first booted on version 0.1.4 and had already gone through a bump to 0.1.5. I ran a small pull request changing the display (and messed up versioning, which CI caught!), merged it, tagged it v0.1.6, and started listening to my Kubernetes cluster:

$ nc 192.168.64.2 8080

   @
 \ |
__\!/__

...

   @
 \ |
__\!/__

vsn?
vsn: 0.1.5

   @
 \ |
__\!/__

      *
   @
 \ |
__\!/__

vsn?
vsn: 0.1.6
      *
   @
 \ |
__\!/__

...

This shows me interrogating the app (vsn?) and getting the version back, and without dropping the connection, having a little pappus floating in the air.

My REPL session was still live in another terminal:

(dandelion@localhost)2>release_handler:which_releases().[{"dandelion","0.1.6",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.6","sasl-4.2"],permanent},{"dandelion","0.1.5",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.5","sasl-4.2"],old},{"dandelion","0.1.4",["kernel-8.4.1","stdlib-4.0.1","dandelion-0.1.4","sasl-4.2"],old}]

showing that the old releases are still around as well. And here we have it, an actual zero-downtime deploy in a kubernetes container.

Conclusion

Joe's favorite program could hold on a business card. Mine is maddening. But I think this is because Joe didn't care for the toolchains people were building and just wanted to do his thing. My version reflects the infrastructure we have put in place, and the processes we want and need for a team.

Rather than judging the scaffolding, I'd invite you to think about what would change when you start centering your workflow around a living system.

Those of you who have worked with bigger applications that have a central database or shared schemas around network protocols (or protobuf files or whatever) know that you approach your work differently when you have to consider how it's going to be rolled out. It impacts your tooling, how you review changes, how you write them, and ultimately just changes how you reason about your code and changes.

In many ways it's a more cumbersome way to deploy and develop code, but you can also think of the other things that change: what if instead of having configuration management systems, you could hard-code your config in constants that just get rolled out live in less than a minute—and all your configs were tested as well as your code? Since all the release upgrades implicitly contain a release downgrade instruction set, just how much faster could you rollback (or automate rolling back) a bad deployment? Would you be less afraid of changing network-level schema definitions if you made a habit of changing them within your app? How would your workflow change if deploying took half-a-second and caused absolutely no churn nor disruption to your cluster resources most of the time?

Whatever structure we have in place guides a lot of invisible emergent behaviour, both in code and in how we adjust ourselves to the structure. Much of what we do is a tacit response to our environment. There's a lot of power in experimenting alternative structures, and seeing what pops up at the other end. A weed is only considered as such in some contexts. This is a freak show of a deployment mechanism, but it sort of works, and maybe it's time to appreciate the dandelions for what they can offer.

Permalink

Erlang/OTP 25.0 Release

Erlang/OTP 25 is a new major release with new features, improvements as well as a few incompatibilities.

For details about new features, bugfixes and potential incompatibilities see the Erlang 25.0 README or the Erlang/OTP 25.0 downloads page.

Many thanks to all contributors!

Erlang/OTP 25.0 Highlights

stdlib

  • New function filelib:ensure_path/1 will ensure that all directories for the given path exists
  • New functions groups_from_list/2 and groups_from_list/3 in the maps module
  • New functions uniq/1 uniq/2 in the lists module
  • New PRNG added to the rand module, for fast pseudo-random numers.

compiler, kernel, stdlib, syntax_tools

  • Added support for selectable features as described in EEP-60. Features can be enabled/disabled during compilation with options (ordinary and +term) to erlc as well as with directives in the file. Similar options can be used to erl for enabling/disabling features allowed at runtime. The new maybe expression EEP-49 is fully supported as the feature maybe_expr.

erts & JIT

  • The JIT now works for 64-bit ARM processors.
  • The JIT now does type-based optimizations based on type information in the BEAM files.
  • Improved the JIT’s support for external tools like perf and gdb, allowing them to show line numbers and even the original Erlang source code when that can be found.

erts, stdlib, kernel

  • Users can now configure ETS tables with the {write_concurrency, auto} option. This option forces tables to automatically change the number of locks that are used at run-time depending on how much concurrency is detected. The {decentralized_counters, true} option is enabled by default when {write_concurrency, auto} is active.
    Benchmark results comparing this option with the other ETS optimization options are available here: benchmarks.
  • To enable more optimizations, BEAM files compiled with OTP 21 and earlier cannot be loaded in OTP 25.
  • The signal queue of a process with the process flag message_queue_data=off_heap has been optimized to allow parallel reception of signals from multiple processes. This can improve performance when many processes are sending in parallel to one process. See benchmark.
  • The Erlang installation directory is now relocatable on the file system given that the paths in the installation’s RELEASES file are paths that are relative to the installations root directory.
  • A new option called short has been added to the functions erlang:float_to_list/2 and erlang:float_to_binary/2. This option creates the shortest correctly rounded string representation of the given float that can be converted back to the same float again.
  • Introduction of quote/1 and unquote/1 functions in the uri_string module - a replacement for the deprecated functions http_uri:encode and http_uri:decode.
  • The new module peer supersedes the slave module. The slave module is now deprecated and will be removed in OTP 27.
  • global will now by default prevent overlapping partitions due to network issues. This is done by actively disconnecting from nodes that reports that they have lost connections to other nodes. This will cause fully connected partitions to form instead of leaving the network in a state with overlapping partitions.
    It is possible to turn off the new behavior by setting the the kernel configuration parameter prevent_overlapping_partitions to false. Doing this will retain the same behavior as in OTP 24 and earlier.
  • The format_status/2 callback for gen_server, gen_statem and gen_event has been deprecated in favor of the new format_status/1 callback.
    The new callback adds the possibility to limit and change many more things than the just the state.
  • The timer module has been modernized and made more efficient, which makes the timer server less susceptible to being overloaded. The timer:sleep/1 function now accepts an arbitrarily large integer.

compiler

  • The maybe ... end construction as proposed in EEP-49 has been implemented. It can simplify complex code where otherwise deeply nested cases would have to be used.
    To enable maybe, give the option {enable_feature,maybe_expr} to the compiler. The exact option to use will change in a coming release candidate and then it will also be possible to use from inside the module being compiled.
  • When a record matching or record update fails, a {badrecord, ExpectedRecordTag} exception used to be raised. In this release, the exception has been changed to {badrecord, ActualValue}, where ActualValue is the value that was found instead of the expected record.
  • Add compile attribute -nifs() to empower compiler and loader with information about which functions may be overridden as NIFs by erlang:load_nif/2.
  • Improved and more detailed error messages when binary construction with the binary syntax fails. This applies both for error messages in the shell and for erl_error:format_exception/3,4.
  • Change format of feature options and directives for better consistency. Options to erlc and the -compile(..) directive now has the format {feature, feature-name, enable | disable}. The -feature(..) now has the format -feature(feature-name, enable | disable).

crypto

  • Add crypto:hash_equals/2 which is a constant time comparision of hashvalues.

ssl

  • Introducing a new (still experimental) option {certs_keys,[cert_key_conf()]}. With this a list of a certificates with their associated key may be used to authenticate the client or the server. The certificate key pair that is considered best and matches negotiated parameters for the connection will be selected.

public_key

  • Functions for retrieving OS provided CA-certs added.

dialyzer

  • Optimize operations in the erl_types module. Parallelize the Dialyzer pass remote.
  • Added the missing_return and extra_return options to raise warnings when specifications differ from inferred types. These are similar to, but not quite as verbose as overspecs and underspecs.
  • Dialyzer now better understands the types for min/2, max/2, and erlang:raise/3. Because of that, Dialyzer can potentially generate new warnings. In particular, functions that use erlang:raise/3 could now need a spec with a no_return() return type to avoid an unwanted warning.

Misc

  • A new DEVELOPMENT HOWTO guide has been added that describes how to build and test Erlang/OTP when fixing bugs or developing new functionality.
  • Testing has been added to the Github actions run for each opened PR so that more bugs are caught earlier when bug fixes and new features are proposed.

Download links for this and previous versions are found here

Permalink

Errors are constructed, not discovered

2022/04/13

Errors are constructed, not discovered

Over a year ago, I left the role of software engineering* behind to become a site reliability engineer* at Honeycomb.io. Since then, I've been writing a bunch of blog posts over there rather than over here, including the following:

There are also a couple of incident reviews, including one on a Kafka migration and another on a spate of scaling-related incidents.

Either way, I only have so much things to rant about to fill in two blogs, so this place here has been a bit calmer. However, I recently gave a talk at IRConf (video).

I am reproducing this talk here because, well, I stand behind the content, but it would also not fit the work blog's format. I'm also taking this opportunity because I don't know how many talks I'll give in the next few years. I've decided to limit how much traveling I do for conferences due to environmental concerns—if you see me at a conference, it either overlapped with other opportunistic trips either for work or vacations, or they were close enough for me to attend them via less polluting means—and so I'd like to still post some of the interesting talks I have when I can.

The Talk

This talk is first of all about the idea that errors are not objective truths. Even if we look at objective facts with a lot of care, errors are arbitrary interpretations that we come up with, constructions that depend on the perspective we have. Think of them the same way constellations in the night sky are made up of real stars, but their shape and meaning are made up based on our point of view and what their shapes remind us of.

The other thing this talk will be about is ideas about what we can do once we accept this idea, to figure out the sort of changes and benefits we can get from our post-incident process when we adjust to it.

I tend to enjoy incidents a lot. Most of the things in this talk aren't original ideas, they're things I've read and learned from smarter, more experienced people, and that I've put back together after digesting them for a long time. In fact, I thought my title for this talk was clever, but as I found out by accident a few days ago, it's an almost pure paraphrasing of a quote in a book I've read over 3 years ago. So I can't properly give attribution for all these ideas because I don't know where they're from anymore, and I'm sorry about that.

A quote: 'Error' serves a number of functions for an organisation: as a defense against entanglement, the illusion of control, as a means for distancing, and as a marker for a failed investigation.

This is a quote from "Those found responsible have been sacked": some observations on the usefulness of error" that I'm using because even if errors are arbitrary constructions, they carry meaning, and they are useful to organizations. The paper defines four types I'll be paraphrasing:

  • Defense against entanglement: the concept of error or fault is a way for an organization to shield itself from the liabilities of an incident. By putting the fault on a given operator, we avoid having to question the organization's own mechanisms, and safely deflect it away.
  • Illusion of control: by focusing on individuals and creating procedures, we can preserve the idea that we can manage the world rather than having to admit that adverse events will happen again. This gives us a sort of comfort.
  • Distancing: this is generally about being able to maintain the idea that "this couldn't happen here", either because we are doing things differently or because we are different people with different practices. This also gives us a decent amount of comfort.
  • Failed investigation: finally, safety experts seem to see the concept of error, particularly human error, as a marker that the incident investigation has ended too early. There were more interesting things to dig into and that hasn't been done—because the human error itself is worth understanding as an event.

So generally, error is useful as a concept, but as an investigator it is particularly useful as a signal to tell you when things get interesting, not as an explanation on their own.

An iceberg above and below the waterline with labels pointing randomly. Above the waterline are operations (scaling, alerting, deploying), and below the waterline are code reviews, testing, values, experience, roadmap, training, behaviours rewarded and punished, etc.

And so this sort of makes me think about how a lot of incident reviews tend to go on. We use the incident as an opportunity because the disruption is big and large enough to let us think about it all. But the natural framing that easily comes through is to lay blame to the operational area.

Here I don't mean blame as in "people fucked up" nearly as much as "where do we think the organisation needs to improve"—where do we think that as a group we need to improve as a result of this. The incident and the operations are the surface, they often need improvement for sure because it is really tricky work done in special circumstances and it's worth constantly adjusting it, but stopping there is missing on a lot of possible content that could be useful.

People doing the operations are more or less thrown in a context where a lot of big decisions have been made already. Whatever was tested, who was hired, what the budgets are, and all these sorts of pressures are in a large part defined by the rest of the organization, they matter as well. They set the context in which operations take place.

So one question then, is how do we go from that surface-level vision, and start figuring out what happens below that organisational waterline.

timelines are necessary: a black line with dots on it and the last end is an explosion

The first step of almost any incident investigation is to start with a timeline. Something that lets us go back from the incident or its resolution, and that we use as breadcrumbs to lead us towards ways to prevent this sort of thing from happening again. So we start at the bottom where things go boom, and walk backwards from there.

the same timeline, but with labels for failure, proximate cause, root cause, and steady state

The usual frameworks we're familiar with apply labels to these common patterns. We'll call something a failure or an error. The thing that happened right before it tends to be called a proximate cause, which is frequently used in insurance situations: it's the last event in the whole chain that could have prevented the failure from happening. Then we walk back. Either five times because we're doing the five whys, or until you land at a convenient place. If there is a mechanical or software component you don't like, you're likely to highlight its flaws there. If it's people or teams you don't trust as much, you may find human error there.

Even the concept of steady state is a shaky one. Large systems are always in some weird degraded state. In short, you find what you're looking for. The labels we use, the lens through which we look at the incident, influence the way we build our explanations.

the timeline is overlaid with light-grey branches that show paths sometimes leading to failures and sometimes not. Those are paths not taken, or that were visible to the practitioner

The overall system is not any of the specific lenses, though, it's a whole set of interactions. To get a fuller richer picture, we have to account for what things looked liked at the time, not just our hindsight-fuelled vision when looking behind. There are a lot of things happening concurrently, a lot of decisions made to avoid bad situations that never took place, and some that did.

Hindsight bias is something somewhat similar to outcome bias, which essentially says that because we know there was a failure, every decision we look at that has taken place before the incident will seem to us like it should obviously have appeared as risky and wrong. That's because we know the result, it affects our judgment. But when people were going down that path and deciding what to do, they were trying to do a good job; they were making the best calls they could to the extent of their abilities and the information available at the time.

We can't really avoid hindsight bias, but we can be aware of it. One tip there is to look at what was available at the time, and consider the signals that were available to people. If they made a decision that looks weird, then look for what made it look better than the alternatives back then.

Counterfactuals are another thing to avoid, and they're one of the trickiest ones to eliminate from incident reviews. They essentially are suppositions about things that have never happened and will never happen. Whenever we say "oh, if we had done this at this point in time instead of that, then the incident would have been prevented." They're talking about a fictional universe that never happened, and they're not productive.

I find it useful to always cast these comments into the future: "next time this happens, we should try to try that to prevent an issue." This orients the discussion towards more realistic means: how can we make this option more likely? the bad one less interesting? In many cases, a suggestion will even become useless: by changing something else in the system, a given scenario may no longer be a concern for the future, or it may highlight how a possible fix would in fact create more confusion.

Finally, normative judgments. Those are often close to counterfactuals, but you can spot them because they tend to be about what people should or shouldn't have done, often around procedures or questions of competence. "The engineer should have checked the output more carefully, and they shouldn't have run the command without checking with someone else, as stated in the procedure." Well they did because it arguably looked reasonable at the time!

The risk with a counterfactual judgment is that it assumes that the established procedure is correct and realistically applicable to the situation at hand. It assumes that deviations and adjustments made by the responder are bound to fail even if we'll conveniently ignore all the times they work. We can't properly correct procedures if we think they're already perfect and it's wrong not to obey them, and we can't improve tooling if we believe the problem is always the person holding it wrong.

Another timeline representation of an incident, flatly linear. It has elements like alert, logging on, taking a look. Then a big gap. Then the issue is found, and the fix is written, and the issue closed

A key factor is to understand that in high pressure incident responses, failure and successes use the same mechanisms. We're often tired, distracted, or possibly thrown in there without adequate preparation. What we do to try and make things go right and often succeed through is also in play when things go wrong.

People look for signals, and have a background and training that influences the tough calls that usually will be shared across situations. We tend to want things to go right. The outcome tends to define whether the decision was good one or not, but the decision-making mechanism is shared both for decisions that go well and those that do not. And so we need to look at how these decisions are made with the best of intentions to have any chance of improving how events unfold the next time.

This leads to the idea you want to look at what's not visible, because they show real work.

Same timeline, but the big gap is highlighted and says 'this is where we repair our understanding of the world'

I say this is "real work" because we come in to a task with an understanding of things, a sort of mental model. That mental model is the rolled up experience we have, and lets us frame all the events we encounter, and is the thing we use to predict the consequences of our decisions.

When we are in an incident, there's almost always a surprise in there, which means that the world and our mental model are clashing. This mismatch between our understanding of the world and the real world was already there. That gap between both needs to be closed, and the big holes in an incident's timelines tend to be one of the places where this takes place.

Whenever someone reports "nothing relevant happens here", these are generally the places where active hypothesis generation periods happen, where a lot of the repair and gap bridging is taking place.

This is where the incident can become a very interesting window into the whole organizational iceberg below the water line.

An iceberg above and below the waterline with labels pointing randomly. Above the waterline are operations (scaling, alerting, deploying), and below the waterline are code reviews, testing, values, experience, roadmap, training, behaviours rewarded and punished, etc.

So looking back at the iceberg, looking at how decisions are made in the moment lets you glimpse at the values below the waterline that are in play. What are people looking at, how are they making their decisions. What's their perspective. These are impacted by everything else that happens before.

If you see concurrent outages or multiple systems impacted, digging into which one gets resolved first and why that is is likely to give you insights about what responders feel confident about, the things they believe are more important to the organization and users. They can reflect values and priorities.

If you look at who they ask help from and where they look for information (or avoid looking for it), this will let you know about various dynamics, social and otherwise, that might be going on in your organization. This can be because some people are central points of knowledge, others are jerks, seen as more or less competent, or also be about what people believe the state of documentation is at that point in time.

And this is why changing how we look at and construct errors matters. If we take the straightforward causal approach, we'll tend to only skim the surface. Looking at how people do their jobs and how they make decisions is an effective way to go below that waterline, and have a much broader impact than staying above water.

A list of questions such as 'what was your first guess?', 'what made you look at this dashboard?', or 'how were you feeling at the time?'

To take a proper dive, it helps to ask the proper type of questions. As a facilitator, your job is to listen to what people tell you, but there are ways to prompt for more useful information. The Etsy debriefing facilitation guide is a great source, and so is Jeli's Howie guide. The slide contains some of the questions I like to ask most.

There's one story I recall from a previous job where a team had specifically written an incident report on an outage with some service X, and the report had that sort of 30 minutes gap in it and were asking for feedback on it. I instantly asked "so what was going on during this time?" Only for someone on the team to answer "oh, we were looking for the dashboard of service Y". I asked why they had been looking at the dashboard of another service, and he said that the the service's own dashboard isn't trustworthy, and that this one gave a better picture of the health of the service through its effects. And just like that we opened new paths for improvements that were so normal it had become invisible.

Another one also came from a previous job where an engineer kept accidentally deleting production databases and triggering a whole disaster recovery response. They were initially trying to delete a staging database that was dynamically generated for test cases, but kept fat-fingering the removal of production instances in the AWS console. Other engineers were getting mad and felt that person was being incompetent, and were planning to remove all of their AWS console permissions because there also existed an admin tool that did the same thing safely by segmenting environments.

I ended up asking the engineer if there was anything that made them choose the AWS console more than the admin tool given the difference in safety, and they said, quite simply, that the AWS console has an autocomplete and they never remembered the exact table name, so it was just much faster to delete that table often there than the admin. This was an interesting one because instead of blaming the engineer for being incompetent, it opened the door to questioning the gap in tooling rather than adding more blockers and procedures.

In both of these stories, a focus on how people were making their decisions and their direct work experience managed to highlight alternative views that wouldn't have come up otherwise. They can generate new, different insights and action items.

A view of a sequence diagram used for an incident review

And this is the sort of map that, when I have time for it, I tried to generate at Honeycomb. It's non-linear, and the main objective is to help show different patterns about the response. Rather than building a map by categorizing events within a structure, the idea is to lay the events around to see what sort of structure pops up. And then we can walk through the timeline and ask what we were thinking, feeling, or seeing.

The objective is to highlight challenging bits and look at the way we work in a new light. Are there things we trust, distrust? Procedures that don't work well, bits where we feel lost? Focusing on these can improve response in the future.

This idea of focusing on generating rather than categorizing is intended to take an approach that is closer to Qualitative science than Quantitative research.

A comparison of attributes if qualitative vs. quantitative research

The way we structure our reviews will have a large impact on how we construct errors. I tend to favour a qualitative approach to a quantitative one.

A quantitative approach will often look at ways to aggregate data, and create ways to compare one incident to the next. They'll measure things such as the Mean Time To Recovery (MTTR), the impact, the severity, and will look to assign costs and various classifications. This approach will be good to highlight trends and patterns across events, but as far as I can tell, they won't necessarily provide a solid path for practical improvements for any of the issues found.

The qualitative approach by comparison aims to do a deep dive to provide more complete understanding. It tends to be more observational and generative. Instead of cutting up the incidents and classifying its various parts, we look at what was challenging, what are the things people felt were significant during the incident, and all sorts of messy details. These will highlight tricky dynamics, both for high-pace outages and wider organizational practices, and are generally behind the insights that help change things effectively.

A comparison of insights obtained with both approaches (as describe in the text)

To put this difference in context, I have an example from a prior jobs, where one of my first mandates was to try and help with their reliability story. We went over 30 or so incident reports that had been written over the last year, and a pattern that quickly came up was how many reports mentioned "lack of tests" (or lack of good tests) as causes, and had "adding tests" in action items.

By looking at the overall list, our initial diagnosis was that testing practices were challenging. We thought of improving the ergonomics around tests (making them faster) and to also provide training in better ways to test. But then we had another incident where the review reported tests as an issue, so I decided to jump in.

I reached out to the engineers in question and asked about what made them feel like they had enough tests. I said that we often write tests up until the point we feel they're not adding much anymore, and that I was wondering what they were looking at, what made them feel like they had reached the points where they had enough tests. They just told me directly that they knew they didn't have enough tests. In fact, they knew that the code was buggy. But they felt in general that it was safer to be on-time with a broken project than late with a working one. They were afraid that being late would put them in trouble and have someone yell at them for not doing a good job.

And so that revealed a much larger pattern within the organization and its culture. When I went up to upper management, they absolutely believed that engineers were empowered and should feel safe pressing a big red button that stopped feature work if they thought their code wasn't ready. The engineers on that team felt that while this is what they were being told, in practice they'd still get in trouble.

There's no amount of test training that would fix this sort of issue. The engineers knew they didn't have enough tests and they were making that tradeoff willingly.

A smooth line with a zoomed-in area that shows it all bumpy and cracked.

So to conclude on this, the focus should be on understanding the mess:

  • go for a deeper understanding of specific incidents where you feel something intriguing or interesting happened. Aggregates of all incidents tend to hide messy details, so if you have a bunch of reviews to do, it's probably better to be thorough on one interesting one than being shallow on many of them
  • Mental models are how problem solving tends to be done; we understand and predict things based on them. Incidents are amazing opportunities to correct and compare and contrast mental models to make them more accurate or more easy to contextualize
  • seek an understanding of how people do their work. There is always a gap between the work as we imagine it to be and what it actually is. The narrower that gap, the more effective our changes are going to be, so focusing on understanding all the nitty gritty details of work and their pressures is going to prove more useful than having super solid numbers
  • psychological safety is always essential; the thing that lets us narrow the gap between work as done and work as imagined is going to be whether people feel safe in reporting and describing what they go through. Without psychological safety and good blame awareness, you're going to have a hard time getting good results.

Overall, the idea is that looking for understanding more than causes opens up a lot of doors and makes incidents more valuable.


* I can't legally call myself a software engineer, and technically neither can I be a site reliability engineer, because Quebec considers engineering to be a protected discipline. I however, do not really get to tell American employers what they should give as a job title to people, so I get stuck having titles I can't legally advertise but for which no real non-protected forms exist to communicate. So anywhere you see me referred to any sort of "engineer", that's not an official thing I would choose as a title. It'd be nice to know what the non-engineer-titled equivalent of SRE ought to be.

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.