Erlang/OTP 26.0 Release

Erlang/OTP 26 is a new major release with new features, improvements as well as a few incompatibilities.

For details about new features, bugfixes and potential incompatibilities see the Erlang 26.0 README or the Erlang/OTP 26.0 downloads page.

Below are some of the highlights of the release:

There is also a Blogpost about the highlights.

Parsetools

  • Leex has been extended with optional column number support.

Stdlib

  • The family of enumeration functions in module lists has been extended with enumerate/3 that allows a step value to be supplied.
  • Update Unicode to version 15.0.0.
  • proc_lib:start*/* has become synchronous when the started process fails. This requires that a failing process use a new function proc_lib:init_fail/2,3, or exits, to indicate failure. All OTP behaviours have been fixed to do this.

The Shell

There are a lot of new features and improvements in the Erlang shell:

  • auto-complete of variables, record names, record fields names, map keys, function parameter types and filenames.
  • Open external editor in the shell to edit the current expression.
  • defining records (with types), functions, specs and types in the shell.

New terminal

  • The TTY/terminal subsystem has been rewritten. Windows users will notice that erl.exe has the same functionality as a normal Unix shell and that werl.exe is just a symlink to erl.exe. This makes the Windows Erlang terminal experience identical to that of Unix.

Compiler and JIT optimizations:

  • Creation and matching of binaries with segments of fixed sizes have been optimized.

  • Creation and matching of UTF-8 segments have been optimized.

  • Appending to binaries has been optimized.

  • The compiler and JIT now generate better code for creation of small maps where all keys are literals known at compile time.

  • Thanks to the optimizations above the performance of the base64 module has been significantly improved. For example, on an x86_64 system with the JIT both encode and decode are almost three times faster than in Erlang/OTP 25.

Maps

  • Map comprehensions as suggested in EEP 58 has now been implemented.

  • Some map operations have been optimized by changing the internal sort order of atom keys. This changes the (undocumented) order of how atom keys in small maps are printed and returned by maps:to_list/1 and maps:next/1. The new order is unpredictable and may change between different invocations of the Erlang VM.

  • Introducing the new function maps:iterator/2 for creating an interator that return the map elements in a deterministic order. There are also new modifiers k and K for the format string in io:format() to support printing map elements ordered.

Dialyzer

  • Added the new built-in type dynamic() introduced in EEP 61, PR introducing EEP 61 improving support for gradual type checkers.

  • Dialyzer has a new incremental mode that be invoked by giving the --incremental option when running Dialyzer. This new incremental mode is likely to become the default in a future release.

Misc ERTS, Stdlib, Kernel, Compiler

  • Multi time warp mode is now enabled by default. This assumes that all code executing on the system is time warp safe.

  • Support for UTF-8 atoms and strings in the NIF interface including new functions enif_make_new_atom, enif_make_new_atom_len and enif_get_string_length.

  • The BIFs min/2 and max/2 are now allowed to be used in guards and match specs.

  • Improved the selective receive optimization, which can now be enabled for references returned from other functions. This greatly improves the performance of gen_server:send_request/3, gen_server:wait_response/2, and similar functions.

  • New trace feature call_memory. Similar to call_time tracing, but instead of measure accumulated time in traced functions it measures accumulated heap space consumed by traced functions. It can be used to compare how much different functions are contributing to garbage collection being triggered.

  • It is no longer necessary to enable a feature in the runtime system in order to load modules that are using it. It is sufficient to enable the feature in the compiler when compiling it.

  • inet:setopts/2 has got 3 new options: reuseport, reuseport_lb and exclusiveaddruse.

  • Fix so that -fno-omit-frame-pointer is applied to all of the Erlang VM when using the JIT so that tools, such as perf, can crawl the process stacks.

  • In the lists module, the zip family of functions now takes options to allow handling lists of different lengths.

  • Added the zip:zip_get_crc32/2 function to retrieve the CRC32 checksum from an opened ZIP archive. gen_server optimized by caching callback functions

  • The modules Erlang DNS resolver inet_res and helper modules have been updated for RFC6891; to handle OPT RR with DNSSEC OK (DO) bit.

  • Introduced application:get_supervisor/1.

  • Cache OTP boot code paths, to limit how many folders that are being accessed during a module lookup. Can be disabled with -cache_boot_path false.

SSL

  • Change the client default verify option to verify_peer. Note that this makes it mandatory to also supply trusted CA certificates or explicitly set verify to verify_none. This also applies when using the so called anonymous test cipher suites defined in TLS versions pre TLS-1.3.

  • Support for Kernel TLS (kTLS), has been added to the SSL application, for TLS distribution (-proto_dist inet_tls), the SSL option {ktls, true}.
  • Improved error checking and handling of ssl options.
  • Mitigate memory usage from large certificate chains by lowering the maximum handshake size. This should not effect the common cases, if needed it can be configured to a higher value.

  • For security reasons the SHA1 and DSA algorithms are no longer among the default values.

  • Add encoding and decoding of use_srtp hello extension to facilitate for DTLS users to implement SRTP functionality.

For more details about new features and potential incompatibilities see the readme

Many thanks to all contributors!

Download links for this and previous versions are found here

Permalink

Erlang/OTP 26.0 Release Candidate 3

Erlang/OTP 26.0-rc3 is the third and last release candidate before the OTP 26.0 release. The release candidate 3 fixes some bugs found in the first two release candidates.

The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue here https://github.com/erlang/otp/issues or by posting to Erlangforums.

All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-14.0-rc3/doc/. You can also install the latest release using kerl like this: kerl build 26.0-rc3 26.0-rc3.

Erlang/OTP 26 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.

Many thanks to all contributors!

Below are some highlights of the release:

Highlights RC2

Parsetools

  • Leex has been extended with optional column number support.

Stdlib

  • The family of enumeration functions in module lists has been extended with enumerate/3 that allows a step value to be supplied.
  • Update Unicode to version 15.0.0.
  • proc_lib:start*/* has become synchronous when the started process fails. This requires that a failing process use a new function proc_lib:init_fail/2,3, or exits, to indicate failure. All OTP behaviours have been fixed to do this.

Dialyzer

SSL

  • Change the client default verify option to verify_peer. Note that this makes it mandatory to also supply trusted CA certificates or explicitly set verify to verify_none. This also applies when using the so called anonymous test cipher suites defined in TLS versions pre TLS-1.3.

Highlights RC1

The Shell

There are a lot of new features and improvements in the Erlang shell:

  • auto-complete of variables, record names, record fields names, map keys, function parameter types and filenames.
  • Open external editor in the shell to edit the current expression.
  • defining records (with types), functions, specs and types in the shell.

New terminal

  • The TTY/terminal subsystem has been rewritten. Windows users will notice that erl.exe has the same functionality as a normal Unix shell and that werl.exe is just a symlink to erl.exe. This makes the Windows Erlang terminal experience identical to that of Unix.

Compiler and JIT optimizations:

  • Creation and matching of binaries with segments of fixed sizes have been optimized.

  • Creation and matching of UTF-8 segments have been optimized.

  • Appending to binaries has been optimized.

  • The compiler and JIT now generate better code for creation of small maps where all keys are literals known at compile time.

  • Thanks to the optimizations above the performance of the base64 module has been significantly improved. For example, on an x86_64 system with the JIT both encode and decode are almost three times faster than in Erlang/OTP 25.

Maps

  • Map comprehensions as suggested in EEP 58 has now been implemented.

  • Some map operations have been optimized by changing the internal sort order of atom keys. This changes the (undocumented) order of how atom keys in small maps are printed and returned by maps:to_list/1 and maps:next/1. The new order is unpredictable and may change between different invocations of the Erlang VM.

  • Introducing the new function maps:iterator/2 for creating an interator that return the map elements in a deterministic order. There are also new modifiers k and K for the format string in io:format() to support printing map elements ordered.

Dialyzer

  • Dialyzer has a new incremental mode that be invoked by giving the --incremental option when running Dialyzer. This new incremental mode is likely to become the default in a future release.

Misc ERTS, Stdlib, Kernel, Compiler

  • Multi time warp mode is now enabled by default. This assumes that all code executing on the system is time warp safe.

  • Support for UTF-8 atoms and strings in the NIF interface including new functions enif_make_new_atom, enif_make_new_atom_len and enif_get_string_length.

  • The BIFs min/2 and max/2 are now allowed to be used in guards and match specs.

  • Improved the selective receive optimization, which can now be enabled for references returned from other functions. This greatly improves the performance of gen_server:send_request/3, gen_server:wait_response/2, and similar functions.

  • New trace feature call_memory. Similar to call_time tracing, but instead of measure accumulated time in traced functions it measures accumulated heap space consumed by traced functions. It can be used to compare how much different functions are contributing to garbage collection being triggered.

  • It is no longer necessary to enable a feature in the runtime system in order to load modules that are using it. It is sufficient to enable the feature in the compiler when compiling it.

  • inet:setopts/2 has got 3 new options: reuseport, reuseport_lb and exclusiveaddruse.

  • Fix so that -fno-omit-frame-pointer is applied to all of the Erlang VM when using the JIT so that tools, such as perf, can crawl the process stacks.

  • In the lists module, the zip family of functions now takes options to allow handling lists of different lengths.

  • Added the zip:zip_get_crc32/2 function to retrieve the CRC32 checksum from an opened ZIP archive. gen_server optimized by caching callback functions

  • The modules Erlang DNS resolver inet_res and helper modules have been updated for RFC6891; to handle OPT RR with DNSSEC OK (DO) bit.

  • Introduced application:get_supervisor/1.

  • Cache OTP boot code paths, to limit how many folders that are being accessed during a module lookup. Can be disabled with -cache_boot_path false.

SSL

  • Support for Kernel TLS (kTLS), has been added to the SSL application, for TLS distribution (-proto_dist inet_tls), the SSL option {ktls, true}.
  • Improved error checking and handling of ssl options.
  • Mitigate memory usage from large certificate chains by lowering the maximum handshake size. This should not effect the common cases, if needed it can be configured to a higher value.

  • For security reasons the SHA1 and DSA algorithms are no longer among the default values.

  • Add encoding and decoding of use_srtp hello extension to facilitate for DTLS users to implement SRTP functionality.

For more details about new features and potential incompatibilities see the readme

Permalink

Erlang/OTP 26.0 Release Candidate 2

OTP 26.0-rc2

Erlang/OTP 26.0-rc2 is the second release candidate of three before the OTP 26.0 release. The release candidate 2 fixes some bugs found in the first release candidate and there are also a few additional features.

The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue here https://github.com/erlang/otp/issues or by posting to Erlangforums.

All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-14.0-rc2/doc/. You can also install the latest release using kerl like this: kerl build 26.0-rc2 26.0-rc2.

Erlang/OTP 26 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.

Many thanks to all contributors!

Below are some highlights of the release:

Highlights RC2

Parsetools

  • Leex has been extended with optional column number support.

Stdlib

  • The family of enumeration functions in module lists has been extended with enumerate/3 that allows a step value to be supplied.
  • Update Unicode to version 15.0.0.
  • proc_lib:start*/* has become synchronous when the started process fails. This requires that a failing process use a new function proc_lib:init_fail/2,3, or exits, to indicate failure. All OTP behaviours have been fixed to do this.

Dialyzer

SSL

  • Change the client default verify option to verify_peer. Note that this makes it mandatory to also supply trusted CA certificates or explicitly set verify to verify_none. This also applies when using the so called anonymous test cipher suites defined in TLS versions pre TLS-1.3.

Highlights RC1

The Shell

There are a lot of new features and improvements in the Erlang shell:

  • auto-complete of variables, record names, record fields names, map keys, function parameter types and filenames.
  • Open external editor in the shell to edit the current expression.
  • defining records (with types), functions, specs and types in the shell.

New terminal

  • The TTY/terminal subsystem has been rewritten. Windows users will notice that erl.exe has the same functionality as a normal Unix shell and that werl.exe is just a symlink to erl.exe. This makes the Windows Erlang terminal experience identical to that of Unix.

Compiler and JIT optimizations:

  • Creation and matching of binaries with segments of fixed sizes have been optimized.

  • Creation and matching of UTF-8 segments have been optimized.

  • Appending to binaries has been optimized.

  • The compiler and JIT now generate better code for creation of small maps where all keys are literals known at compile time.

  • Thanks to the optimizations above the performance of the base64 module has been significantly improved. For example, on an x86_64 system with the JIT both encode and decode are almost three times faster than in Erlang/OTP 25.

Maps

  • Map comprehensions as suggested in EEP 58 has now been implemented.

  • Some map operations have been optimized by changing the internal sort order of atom keys. This changes the (undocumented) order of how atom keys in small maps are printed and returned by maps:to_list/1 and maps:next/1. The new order is unpredictable and may change between different invocations of the Erlang VM.

  • Introducing the new function maps:iterator/2 for creating an interator that return the map elements in a deterministic order. There are also new modifiers k and K for the format string in io:format() to support printing map elements ordered.

Dialyzer

  • Dialyzer has a new incremental mode that be invoked by giving the --incremental option when running Dialyzer. This new incremental mode is likely to become the default in a future release.

Misc ERTS, Stdlib, Kernel, Compiler

  • Multi time warp mode is now enabled by default. This assumes that all code executing on the system is time warp safe.

  • Support for UTF-8 atoms and strings in the NIF interface including new functions enif_make_new_atom, enif_make_new_atom_len and enif_get_string_length.

  • The BIFs min/2 and max/2 are now allowed to be used in guards and match specs.

  • Improved the selective receive optimization, which can now be enabled for references returned from other functions. This greatly improves the performance of gen_server:send_request/3, gen_server:wait_response/2, and similar functions.

  • New trace feature call_memory. Similar to call_time tracing, but instead of measure accumulated time in traced functions it measures accumulated heap space consumed by traced functions. It can be used to compare how much different functions are contributing to garbage collection being triggered.

  • It is no longer necessary to enable a feature in the runtime system in order to load modules that are using it. It is sufficient to enable the feature in the compiler when compiling it.

  • inet:setopts/2 has got 3 new options: reuseport, reuseport_lb and exclusiveaddruse.

  • Fix so that -fno-omit-frame-pointer is applied to all of the Erlang VM when using the JIT so that tools, such as perf, can crawl the process stacks.

  • In the lists module, the zip family of functions now takes options to allow handling lists of different lengths.

  • Added the zip:zip_get_crc32/2 function to retrieve the CRC32 checksum from an opened ZIP archive. gen_server optimized by caching callback functions

  • The modules Erlang DNS resolver inet_res and helper modules have been updated for RFC6891; to handle OPT RR with DNSSEC OK (DO) bit.

  • Introduced application:get_supervisor/1.

  • Cache OTP boot code paths, to limit how many folders that are being accessed during a module lookup. Can be disabled with -cache_boot_path false.

SSL

  • Support for Kernel TLS (kTLS), has been added to the SSL application, for TLS distribution (-proto_dist inet_tls), the SSL option {ktls, true}.
  • Improved error checking and handling of ssl options.
  • Mitigate memory usage from large certificate chains by lowering the maximum handshake size. This should not effect the common cases, if needed it can be configured to a higher value.

  • For security reasons the SHA1 and DSA algorithms are no longer among the default values.

  • Add encoding and decoding of use_srtp hello extension to facilitate for DTLS users to implement SRTP functionality.

For more details about new features and potential incompatibilities see the readme

Permalink

Embedded and cloud Elixir for grid-management at Sparkmeter

Welcome to our series of case studies about companies using Elixir in production. See all cases we have published so far.

SparkMeter is a company on a mission to increase access to electricity. They offer grid-management solutions that enable utilities in emerging markets to run financially-sustainable efficient, and reliable systems.

Elixir has played an important role in simplifying SparkMeter systems by providing a unified developer experience across their products. Elixir’s versatility in different domains, such as embedded software, data processing, and HTTP APIs, proved to be a valuable asset to a team who aims to release robust products quickly and confidently.

Two of their products are smart electrical meters and grid-management software. These can be used to measure electricity usage, gather health information about an electrical grid, and manage billing.

Here’s an overview of their architecture:

SparkMeter architecture generation one

The meters are embedded devices responsible for collecting measures such as electricity usage. They communicate with each other via a mesh network and also communicate with the grid edge management unit. The grid edge management unit is an embedded system that receives and processes data from up to thousands of meters. The grid edge management unit also communicates with servers running in the cloud. Those servers send and receive data to the grid edge management units and process it for use by internal systems and user-facing software.

The challenge

The infrastructure in which their embedded devices are deployed is not reliable. The cellular network used for communication between the ground and the cloud could fail, and the electricity supply to the embedded systems could go down. Therefore, their system needed to be fault-tolerant, and they needed to build equipment that didn’t require constant field maintenance.

In light of these requirements, they identified areas for improvement in the first generation of their product. One of the things they needed to improve was the development of a new grid edge management unit. Additionally, their product was mission-critical, so they wanted a technology they could confidently put into production and one that would not take more than a year of development and QA before releasing a new generation of their product.

That’s when they discovered Elixir and Nerves.

The trade-offs of adopting Elixir and Nerves

Nerves is an open-source platform that combines the Erlang virtual machine and Elixir ecosystem to build and deploy embedded systems.

When considering the adoption of Elixir and Nerves, SparkMeter recognized many advantages the technologies offered.

Elixir helped them meet the requirement of building a distributed and fault-tolerant system. That’s because Elixir leverages the power of the Erlang VM and the OTP framework, which were designed with that requirement in mind.

Regarding Nerves, they saw it as an entire ecosystem for doing embedded development with many advantages. For example, it has a good story for doing local development and going from that to deploying on an embedded device. It makes it easy to connect to an embedded device for iterative development. And it also enables fine-grained control of system boot, so they can handle scenarios when certain parts of the system won’t start.

That said, they had two concerns, the growth of Nerves and finding talent with expertise in the Elixir/Nerves stack.

They wanted to ensure that Nerves would continue to grow. But they realized that even if it didn’t, the benefits Nerves was already offering could give them a lot of leverage. Here are’s what their senior VP of engineering, Jon Thacker, had to say about that:

Without Nerves, we would be on our own to figure out a lot. How to do distribution, the development environment, and how to support different architectures. So it really is a batteries-included framework for doing production-grade embedded systems.

- Jon Thacker, Senior VP of Engineering

When we interviewed Jon for this case study, they had already been using Elixir and Nerves for more than two years. And with the benefit of hindsight, here’s what he said about adopting Nerves:

Making sure that Nerves continued to grow was a concern. But it has done so and is showing a very positive trajectory. It was a calculated risk and, as it turns out, it was the correct choice.

- Jon Thacker, Senior VP of Engineering

When it came to finding talent, they approached the problem in two ways. First, they started to build the pilot with a contractor to ensure that the staffing risk didn’t affect their timeline. But they also wanted to have an internal team to take ownership of the product in the long term. So, shortly after finishing the first version of the new system, they hired two engineers with experience in Elixir, Michael Waud and Benjamin Milde.

Besides hiring people with previous experience in Elixir, Jon noticed that training their embedded engineers in Elixir was also a viable option. Here’s what he told us about that:

I’m traditionally an embedded engineer, and I only learned Elixir as part of this project. However, transferring my mental model was so easy that I do believe that we would be capable of training other embedded engineers as well.

- Jon Thacker, Senior VP of Engineering

The new system

SparkMeter used Elixir for the ground (embedded) and cloud aspects of the new system they built. Here is an overview of the architecture:

SparkMeter architecture generation two

For the firmware of the grid edge management unit, they used Nerves. For the hardware, they built on top of a BeagleBone Black device.

The communication between the grid edge management unit and the meters was via radio, using Rust to manage the radio hardware module inside the grid edge management unit. They used Elixir Ports to communicate with Rust and process the data from the meters.

Elixir was also used for communication with the cloud servers via 3G or Edge. This communication required bandwidth usage optimization due to the cost of sending large volumes of data through the cellular network. They evaluated various solutions like REST, CoAP, MQTT, Kafka, and Websockets. Still, none fit their specific needs, so they created a custom protocol tailored to their use case, which involved designing a binary protocol and implementing a TCP server. Mike Waud discussed this in more detail in his talks at ElixirConf 2021 and 2022.

The grid edge management unit also required a local web user interface that could be accessed on-site via Wi-Fi. For this, they used Phoenix and Liveview.

The cloud aspect of the system is responsible for receiving data from the grid edge management units and sending control commands. It also runs a TCP server with their custom protocol, implemented in Elixir. The data received from the grid edge management units is stored in PostgreSQL and then consumed by a Broadway-based data pipeline.

The cloud system also exposes an HTTP API implemented with Phoenix. This API is consumed by other internal systems to interact with their PostgreSQL database.

Reaping the benefits

During and after the development of the new generation of their system, SparkMeter observed many benefits.

One of them was the reduction of the complexity of the grid edge management unit. The old version had more moving parts, using Ubuntu and Docker for the system level, Python/Celery and RabbitMQ for asynchronous processing, and Systemd for managing starting job processes.

In the new version, they replaced all of that mainly with Elixir and Nerves. And for the parts where they needed tools that were not part of the BEAM stack, they could manage them like any other BEAM process by using Elixir Ports. Here’s what they said about that experience:

The new grid edge management unit has a very unified architecture. We can treat everything as an (Elixir) process. We have full control over the start and stop within a single ecosystem. It’s just a very coherent storyline.

- Jon Thacker, Senior VP Of Engineering

Another aspect they liked about Nerves was that it included security best practices. For example, they used SSL certificates on the client and the server side for communication between the ground and the cloud. Nerves made this easy through the NervesKey component, which enables the use of a hardware security module to protect the private key. Nerves also made it easy to keep up with system security patches, as the firmware generated by Nerves is a single bundle containing a minimal Linux platform and their application packaged as a release. Here’s what they said about security in Nerves:

It’s easy enough to keep tracking upstream changes, so we’re not getting behind the latest security patches. Nerves made that easy. Nerves just pushed us towards a good security model.

- Jon Thacker, Senior VP Of Engineering

The communication between the ground and the cloud involved implementing a custom TCP server running in both parts of the system. Network programming is not an everyday task for many application developers, but Elixir helped them a lot with that:

I had never written a TCP client or a server before, it’s just not something you even think about. But doing it in Elixir, particularly on the protocol level of sending binaries, was a pleasure to work with! Something that would be super tedious in an imperative language, with Elixir and pattern matching, is so clear!

- Michael Waud, Senior Software Engineer

Another benefit they received from using Elixir on the ground and in the cloud was code reuse. For example, the encoding and decoding of their custom protocol were reused for both the embedded and cloud parts.

It would’ve been a much larger challenge if we hadn’t been running Elixir in the cloud and on the grid edge management unit because we could write it once. The encoding and decoding we wrote once, we gained a lot from being able to share code.

- Michael Waud, Senior Software Engineer

Michael also pointed out that by controlling the complete connection from the grid edge management unit up to the cloud, they could reduce bandwidth usage and improve resiliency, which were essential requirements for them.

Finally, the new generation of their system also enabled them to release more often. Before, they were releasing new versions every quarter, but with the new system, they could release weekly when needed.

Summing up

In conclusion, SparkMeter’s adoption of Elixir and Nerves has led to many benefits for their mission-critical grid-management system.

Elixir was used to design elegant solutions across data processing, HTTP APIs, and within the embedded space. This unified development model led to a more productive and robust environment, with less complexity and fewer moving parts.

Additionally, the ability to control the entire connection from the ground to the cloud resulted in reduced bandwidth usage and improved resiliency. This fulfills essential requirements, given the diversity of conditions and locations the grid edge management unit may be deployed at.

The new system also allowed for more frequent releases, enabling SparkMeter to respond quickly to their business needs.

Permalink

A Bridge Over a River Never Crossed

2023/01/01

A Bridge Over a River Never Crossed

When I first started my forever project, a peer to peer file sync software using Interval Tree Clocks, I wanted to build it right.

That meant property-based testing everything, specifying the protocol fully, dealing with error conditions, and so on. Hell, I grabbed a copy of a TLA+ book to do it.

I started a document where I noted decisions and managed to write up a pretty nifty file-scanning library that could pick up and model file system changes over trees of files. The property tests are good enough to find out when things break due to Unicode oddities, and I felt pretty confident.

Then I got to writing the protocol for a peer-to-peer sync, with the matching state machines, and I got stuck. I couldn't come up with properties, and I had no idea what sort of spec I would even need. Only one sort of comparison kept popping in my mind: how do you invent the first telephone?

It’s already challenging to invent any telephone (or I would assume so at least), even with the benefit of having existing infrastructure and networks, on top of having other existing telephones to test it with. But for the first telephone ever, you couldn’t really start with a device that has both the mouthpiece and the ear piece in place, and then go “okay now to make the second one” and have a conversation once they're both done.

In some ways you have to imagine starting with two half-telephones, with a distinct half for each side. You start with a part to speak in and a speaker part that goes on the other side and send messages one way, and then sort of gradually build up a whole pair I guess?

An actor portraying Alexander Graham Bell speaking into a early model of the telephone for a 1926 promotional film by AT&T, public domain. The phone is a simple conical part  in which the actor is speaking, attached to a piece of wood, with no ear piece at all

This was the sort of situation I was finding myself in for the protocol: I wanted to build everything correctly the first time around, but I had no damn idea about how to wire up only one fine half to nothing just to figure out what shape exactly should a whole exchange have. I couldn't do it right all at once.

I had written protocols before, I had written production-grade distributed software before, there was prior art for this sort of thing, but I had never built this specific one.

This was like wanting to build a bridge, a solid one, to go over a river I had never crossed before. I could imagine the finished product’s general shape and purpose, I was eager to get to cross it. I had worked on some before, but not over this specific river. Hell, without having gone over the gap once end-to-end, I had no idea what the other side looked like.

I had also prototyped things before, and always wanted to make sure the prototype wouldn't end up in production too. As it turns out, forcing myself to prototype things and make a very bad version of the software was the most effective way to make a slightly less bad version of it that follows. And then a slightly better one, and another. This was iterative development winning over careful planning.

I’m at the point where I have a shoddy wooden bridge that I can cross over on. It’s real crappy software, it doesn’t deal with errors well (it’s safe and doesn’t break things, but it’s also unreliable and hangs more than I'd like), it’s not very fast, and it's downright unusable. But I now have a lot more infrastructure to work with. And once I’m through with the mess, I can maybe design a nicer form of it.

Building the bridge as you cross the river for the first time is a paralyzing thought, and despite all my wishes about it being done right on that initial attempt, it turns out it's a pretty good idea to make that first crossing as cheap and easy to tear down—and replace—as possible.

Saying "build a prototype and be ready to replace it" is a long known piece of conventional wisdom. The challenge is how crappy or solid should your prototype be? What is it that you're trying to figure out, and are you taking the adequate means for it?

There is a difference between a rough sketch with the right proportions and exploring from an entirely wrong perspective. Experience sort of lets you orient yourself early, and also lets you know which kind you have in your hands. I guess I'll find out soon if all the fucking around was done in the proper direction.

Funnily enough, traditional arch bridges were built by first having a wood framing on which to lay all the stones in a solid arch. That wood framing is called falsework, and is necessary until the arch is complete and can stand on its own. Only then is the falsework taken away. Without it, no such bridge would be left standing. That temporary structure, even if no trace is left of it at the end, is nevertheless critical to getting a functional bridge.

Falsework centering in the center arch of Monroe Street Bridge, Spokane, Washington, 1911. An elaborate wooden structure is supporting concrete until it can self-stand.

I always considered prototypes to be a necessary exploratory step, where you make mistakes, find key risks about your project, and de-risk a more final clean version. They were dirty drafts, meant to be thrown out until a clean plan could be drawn. I thought, if I had enough experience and knowledge, I could have that clean plan and just do it right.

Maybe I just needed to get over myself and consider my prototype to in fact be Falsework: essential, unavoidable, fundamentally necessary, even if only temporary.

Permalink

Cheatsheets and other 8 ExDoc features that improve the developer experience

ExDoc has a cool new feature, cheatsheets!

In this blog post, we’ll explain what that new feature is and the motivation behind it. We’ll also take the opportunity to highlight other ExDoc features that show how it has been evolving to make the documentation experience in Elixir better and better.

What are ExDoc cheatsheets and how they improve the documentation experience

ExDoc’s cheatsheets are Markdown files with the .cheatmd extension. You can see an example of how the Ecto project is using them.

Writing and reading cheatsheets is not exactly new to developers. What ExDoc brings to the table is the possibility of integrating cheatsheets alongside the rest of the documentation of an Elixir project, instead of hosting them in a different place.

Developers need different kinds of documentation at different times. When one is learning about a new library, a guide format is proper. When one needs to know if a library can solve a specific problem, an API reference can be more appropriate. When someone wants to remember a couple of functions they already used from that library, a cheatsheet could be more practical.

Imagine if you had to go to a different place for every type of documentation you’re looking for. That would make a very fragmented experience, not only for readers of documentation but also for writers.

ExDoc cheatsheets represent one step further in the direction of making documentation in Elixir an even more comprehensive and integrated experience.

ExDoc cheatsheets are inspired by devhints.io from Rico Sta. Cruz, and were contributed by Paulo Valim and Yordis Prieto.

Eight features that show how ExDoc has improved developer experience over time

We added cheatsheets to ExDoc because we value developer experience and believe documentation is a core aspect of it.

Since the beginning, one of Elixir’s principles is that documentation should be a first-class citizen. What this idea means to us is that documentation should be easy to write and easy to read. ExDoc has been continuously evolving over the years, guided by this principle.

Here are some of the features added to ExDoc over the years that make reading and writing documentation in Elixir a joy.

Beautiful and usable design

As developers, we may not have the skill to make beautifully designed UIs. That doesn’t mean we don’t appreciate it. Here’s what documentation generated with ExDoc looked like almost ten years ago, with its original layout based on YARD:

Screenshot of the Phoenix v0.5.0 documentation generated with an early version of ExDoc

Here’s what it looks like today:

Screenshot of the Phoenix v1.6.15 documentation generated with current ExDoc

The evolution of ExDoc’s design helped documentation be more visually appealing and easier to read and navigate.

Sometimes you’re reading the documentation of a library, and you want to know more about the implementation of a function. Or, you found something in the documentation that could be improved and wants to help. In those situations, it’s helpful to go from the documentation to the source code. ExDoc makes that dead easy. For every module, function, or page, ExDoc gives you a link that you can click to go directly to the project’s source code on GitHub:

Short screencast of a user clicking on the "link to source code" button on the documentation for a function

Guides

One of the most common formats of library documentation is an API reference. But depending on your needs, that’s not the most approachable format. For example, it’s not optimal when you’re just getting started with a library or when you want to learn how to solve a specific problem using it. That’s why ExDoc allows writing other types of docs besides API references, like “Getting started” guides or How-tos.

Look at how Ecto’s documentation uses that, for example:

Screencast of a user exploring the guides in the Ecto documentation

Custom grouping of modules, functions, and pages in the sidebar

Sometimes your library has dozens of modules. Sometimes, one given module has a large API surface area. In those situations, showing the list of functions as a single large list may not be the most digestible way to be consumed. For those cases, ExDoc allows modules, functions, or extra pages to be grouped in the sidebar in a way that makes more sense semantically.

Here’s an example of how Ecto use grouped functions for its Repo module:

Screenshot of the sidebar of the Ecto documentation, showing grouped functions in the `Ecto.Repo` module

Instead of listing the ~40 functions of Ecto.Repo as a single extensive list, it presents them grouped by five cohesive topics:

  • Query API
  • Schema API
  • Transaction API
  • Runtime API
  • User callbacks

The same functionality is available for modules and pages (guides, how-tos, and so on). Phoenix is a good example of how that’s used.

Sometimes you don’t know or don’t remember the name of the function that you’re looking for. For example, let’s say you’re looking for a function for dealing with file system directories.

Although there’s no function or module called “directory” in Elixir, when you type “directory” in Elixir’s documentation, it will return all the entries that have the word “directory” inside the documentation. It will even return entries with variations of the word “directory”, like “directories”, doing a fuzzy search.

Screenshot of the result of searching for "directory" in the Elixir documentation

The search bar also supports autocompletion for module and function names:

Screencast of a user typing the word "Enum" in the search bar of Elixir's documentation and letting it autocomplete the module. Then, the user types "Range" and both modules and functions show up.

The best part is that full-text search is fully implemented on the client-side, which means ExDoc pages can be fully hosted as static websites (for example on GitHub Pages).

Keyboard shortcuts to navigate to docs of other Hex packages

It’s common for an application to have dependencies. While coding, we usually need to read the documentation of more than one of those dependencies.

One solution is to keep a window open for the documentation of each dependency. However, ExDoc offers another option: a keyboard shortcut to search and go to another package documentation within the same window.

Here’s what it looks like:

Screencast of a user enabling the `g` shortcut to search through dependencies documentation and then using it to search for "phoenix_live" in the documentation for Nerves.

There are more keyboard shortcuts to help you navigate within and between documentation:

Screenshot of the keyboard shortcuts that you can enable in ExDoc

A version dropdown to switch to other versions

Keeping our application updated with the latest versions of all its dependencies can be challenging. So, it’s common to need to look at the documentation of an older version of a library we’re using. ExDoc makes it simple to do that.

When you access the documentation of a project, there’s a dropdown that you can use to select the version you’re looking for:

Screencast of a user typing the version dropdown under the application name in the "timex" documentation, revealing all the versions.

Livebook integration

Livebook is a web application for writing interactive and collaborative code notebooks in Elixir.

One of the ways Elixir developers have been using Livebook is for documentation. Because of its interactivity capabilities, it enables the reader to play with the code right inside the documentation, which makes it great for tutorials and augmenting the user experience.

With that in mind, ExDoc offers the possibility of integrating Livebook notebooks. That means one can host Livebook-based documentation together with the API reference.

Here’s an example of using Livebook inside ExDoc for writing a Usage Guide:

Screencast of a user navigating through the "req_sandbox" documentation, finding a Livebook, clicking "Run in Livebook", and using the Livebook that opens up on their local machine.

Bonus: Erlang support

EEP 48 proposed a standardized way for how BEAM languages could store API documentation. This allows any BEAM language to read documentation generated by each other.

By leveraging that work, ExDoc can generate documentation for an Erlang project. For example, Telemetry is a library written in Erlang that has its documentation generated with ExDoc.

Screenshot of "telemetry" documentation generated with ExDoc

By using ExDoc to also generate documentation for Erlang-based projects, we can have more consistency in the user experience along the BEAM ecosystem. See the great rebar3_ex_doc plugin to get started.

Bonus: Doctests

When writing documentation, it’s helpful to offer code examples. For instance, here’s the documentation of the Enum.any?/1 function from Elixir’s standard library:

@doc """
Returns `true` if at least one element in `enumerable` is truthy.

When an element has a truthy value (neither `false` nor `nil`) iteration stops
immediately and `true` is returned. In all other cases `false` is returned.

## Examples

  iex> Enum.any?([false, false, false])
  false

  iex> Enum.any?([false, true, false])
  true

  iex> Enum.any?([])
  false

"""

To ensure examples do not get out of date, Elixir’s test framework ExUnit provides a feature called doctests. This allows developers to test the examples in their documentation. Doctests work by parsing out code samples starting with iex> from the documentation.

Although this is not a feature of ExDoc, it is an essential part of Elixir’s developer and documentation experience.

Wrap up

As we saw, ExDoc has evolved a lot throughout the years! As it continues to evolve into a more and more comprehensive documentation tool, we want to enable developers to keep investing more time writing the documentation itself instead of needing to spend time building custom documentation tools and websites. The best part is that all you need to do to leverage many of those features is to simply document your code using the @doc attribute!

Here’s to a continuously improving documentation experience for the next years.

Permalink

The Law of Stretched [Cognitive] Systems

2022/12/15

The Law of Stretched [Cognitive] Systems

One of the things I knew right when I started at my current job is that a lot of my work would be for "nothing." I'm saying this because I work (as Staff SRE) for an observability vendor, and engineers tend to operate under the idea that the work they're doing is going to make someone's life easier, lower barriers of entry, or just make things simpler by making them understandable.

While this is a worthy objective that I think we are helping, I also hold the view that any such improvements would be used to expand the capacities of the system such that its burdens remain roughly the same.

I hold this view because of something called the Law of stretched systems:

Every system is stretched to operate at its capacity; as soon as there is some improvement, for example in the form of new technology, it will be exploited to achieve a new intensity and tempo of activity.

Chances are you've noticed that the more RAM computers have, the more RAM browsers are going to take for tabs. The faster networks get, the larger the web pages that are served to you are going to be. If storage space is plentiful and cheap, movies and games and pictures are all going to get bigger and occupy that space.

If you've maintained APIs, you may have noticed that no matter what rate limit you put on an endpoint or feature, someone is going to ask for an order of magnitude more and find ways to use that capacity. You give people 10 alerts of budget for their top-line features, and they'll think 100 would be nice so you have one per microservice. You give them 100 and they start thinking maybe a 1,000 would be nice so each team can set 10, for the various features they maintain. You give them 1,000 and they start thinking 10,000 would be quite nice so each of their customers could get its own alert. Give more and maybe they can start reselling the feature themselves.

What is available will be used. Every system is stretched to operate at its capacity. Systems keep some slack capacity, but if they operate for long periods of time without this capacity being called upon, it likely gets optimized away.

Similar examples seem to also be present in larger systems—you can probably imagine a few around just-in-time supply chains given the last few years—but I'll avoid naming specifics as I'd be getting outside my own areas of expertise.

The law of stretched systems, I believe, applies equally well to most bottlenecks you can find in any socio-technical system. This would include your ability to keep up with what is going on with your system (be it social or technical) due to its complexity, intricacies, limited observability or understandability, or difficulties to enact change.

As far as I can tell, cognitive bandwidth and network bandwidth both display similar characteristics under that lens. That means that gaining extra capacity to understand what is going on, more clarity into the actions and functioning of the system is not likely to make your situation more comfortable in the long term; it's just going to expand how much you can accomplish while staying on that edge of understandability.

The pressures that brought the system's operating point where it is are likely to stay in place, and will keep influencing feedback loops it contains. Things are going to stay as fast-paced as they were, grow more hectic, but with better tools to handle the newly added chaos, and that's it. And that is why the work I do is for "nothing": things aren't going to get cozier for the workers.

Gains in productivity over the last decades haven't really reduced the working hours of the average worker, but it has transformed how it is done. I have no reason to believe that gains in understandability (or on factors affecting productivity) would change that. We're just gonna get more software, moving faster, doing more things, always bordering on running out of breath.

And once the system "stabilizes", that the new tools or methods become a given, when they fade in the background as normal everyday things, the system will start optimizing some of its newly found slack away. Its ability to keep going will become dependent on these tools, and were they to stop working, major disruptions should be expected to adapt and readjust (or collapse).

This has, I suppose, a weird side-effect in that it's an ever-ongoing ladder-pulling move. The incumbent tool-chain has greater and greater network effects, and any game-changing approach that does not mesh well with the rest of it (but could sustain even greater capacity had it received a similar investment) will never be able to get a foothold into an established ecosystem. Any newcomer in the landscape has to either pay an ever-increasing cost just to get started, or has to have such a different and distinct approach that it cancels out the accrued set of optimizations others have.

These costs may be financial, technological, in terms of education and training, or cognitive. Maybe this incidental ladder-pulling is partially behind organizations always wanting—or needing—ever more experienced employees even for entry-level positions.

There's something a bit disheartening about assuming that any move you make that improves how understandable a system is will also rapidly be used to move as close to the edge of chaos as possible. I however have so far not observed anything that would lead me to believe things are going to be any different this time around.

Permalink

Get Rid of Your Old Database Migrations

Database migrations are great. I love to be able to change the shape of tables and move data around in a controlled way to avoid issues and downtime. However, lately I started to view migrations more like Git commits than like active pieces of code in my applications. In this post, I want to dig a bit deeper into the topic. I'll start with some context on database migrations, I'll expand on the Git commits analogy, and I'll show you what I've been doing instead.

I mostly write Elixir, so my examples will be in Elixir. If you're interested in the workflow I’m currently using for Elixir, jump to the last section of the post.

Cover image of a flock of birds flying, with a sunset sky in the background

What Are DB Migrations

DB migrations are pieces of code that you run to change the schema (and data, if desired) of your database. Wikipedia does a great job at explaining migrations, their history, and their use cases. I'll write a few words here for context, but go read that if you want to dig deeper.

Migrations have a few selling points over executing direct SQL commands.

First, database migrations usually keep information about the migrations themselves in a dedicated database table. To be precise, the framework that runs the migrations is the one storing information in that table. The framework uses this table to only run migrations that have not been run yet.

Another benefit is that you'll usually write migrations in a way that makes it possible to roll them back. "Rolling back" a migration means executing SQL statements that revert the changes done in that migration. Most database frameworks can infer the rollback steps from the migration definition if the migration is simple enough. This results in concise pieces of code to alter the database schema that can be executed "in both directions".

Let's see an example with Ecto, Elixir's database framework.

Example with Ecto

Let's say you have a users table with columns email and password. New requirements force you to add two new columns, first_name and last_name. To do that, you can write a migration that looks like this:

defmodule MyApp.AddNameToUsers do
  use Ecto.Migration

  def change do
    alter table("users") do
      add :first_name, :string
      add :last_name, :string
    end
  end
end

In this simple example, the framework is able to do what I mentioned above: you can define a single change/0 function with migration steps in it, and the framework is able to infer the corresponding rollback steps (in this case, removing the two new columns).

When you run the migration (with mix ecto.migrate in this case), Ecto adds a row to the schema_migrations table:

versioninserted_at
202211142328412022-11-15 21:27:50

Why Do We Keep Migrations Around?

Until recently, I had never worked on an application that did not keep all the migrations around. I'd always seen the priv/repo/migrations directory in Elixir applications full of files. I want to get one disclaimer out of the way: the many-migrations experience is a personal one, and I might be late to the party here. But hey, here's to hoping someone else is too and that this write-up is gonna help them out.

At one point, I started working on an older unfamiliar codebase. The experience made me think of two things.

The first one is reading the complete, up-to-date database schema structure. I'd constantly fire up a Postgres GUI (I use TablePlus) to look at the structure of the database, since it was hard to navigate old migrations and piece together what their end result is.

The second one revolves around searching through code. Working on the new codebase involved a lot of searching all around the code to understand the structure of the application. Function names, modules, database columns, and what have you. However, database columns stuck with me: I'd always find a bunch of misleading search results in old migrations. For example, I'd see a few results for a column name that was created, then modified, and then dropped.

So I started wondering: why do we keep old migrations around? Don't get me wrong, I know why we write migrations in the first place. They're great, no doubts. But why not throw them away after they've done their job? How many times did you roll back more than one migration? I have never done that. It's hard to imagine rolling back many changes, especially when they involve not only the schema but also the data in the database itself. There must be a better way.

Analogy with Git Commits

I started to think of database migrations more like Git commits. You apply commits to get to the current state of your application. You apply database migrations to get to the current schema of your database. But after some time, Git commits become a tool for keeping track of history more than an active tool for moving back and forth between versions of the code. I'm now leaning towards treating database migrations the same way. I want them to stick around for a bit, and then "archive" them away. They're always going to be in the Git history, so I’m never really losing the source file, only the ability to apply the migrations.

So, how do we deal with this in practice?

Dumping and Loading

It turns out that this is something others have already thought about.

Database frameworks that provide migration functionality usually provide ways to dump and load a database schema. If they don't, fear not: major databases provide that themselves. In fact, in Elixir Ecto's dump and load tasks only really act as proxies on top of tools provided by the underlying databases (such as pg_dump and psql for PostgreSQL).

The idea is always the same: to get the current state of the database, you'll run the dumping task. With Ecto and other frameworks, this produces an SQL file of instructions that you can feed to your database when you want to load the schema again.

Some frameworks provide a way to squash migrations instead. Django, for example, has the squashmigrations command. However, the concept is almost the same. Ruby on Rails's ActiveRecord framework has a unique approach: it can generate a Ruby schema file from migrations. It can also generate the SQL schema file mentioned above via the database, but the Ruby approach is interesting. Its power is limited, however, since the Ruby schema file might not be able to reconstruct the exact schema of the database. From the documentation:

While migrations may use execute to create database constructs that are not supported by the Ruby migration DSL, these constructs may not be able to be reconstituted by the schema dumper.

Dumping and loading the database schema works well in local development and testing, but not in production though, right? You don't want to load a big old SQL file in a running production database. I think. Well, you don't really have to. Production databases tend to be reliable and (hopefully) backed up, so "restoring" a schema is not something you really do in production. It'd be analogous to re-running all the migrations: you just never do it.

Advantages and Disadvantages of Ditching Old Migrations

I find that dumping old migrations and loading an up-to-date SQL file has a few advantages.

  1. You get a complete view of the schema — The SQL file with the database schema now represents a complete look at the structure of the database. You can see all the tables, indexes, default values, constraints, and so on. Sometimes, you'll still need to create migrations and run them, but they're going to live in your codebase only temporarily, and it's only going to be a handful of them at a time, instead of tens (or hundreds) of files.

  2. Speed: A minor but not irrelevant advantage of this approach is that it speeds up resetting the database for local development and tests. Applying migrations can do many unnecessary operations in the database, such as creating tables only to delete them just after. When loading the database dump, you're really doing the smallest possible set of commands to get to the desired state.

However, it's not all great here. There are some disadvantages as well:

  1. Digging through Git — there are going to be situations in which you look at the migration table in your database and want to figure out the migration that corresponds to a given row. This approach makes this use case slightly more annoying, because you'll have to dig through your Git history to find the original migration. Not a big deal in my opinion, I don't really do this that much.

  2. Deploying without running migrations — make sure to deploy and run migrations. With this approach, that's not something to give for granted. You might get a bit too comfortable dumping the current database schema and deleting migration files. You might end up in situations where you create a migration, run it locally, and then dump the schema and delete the migration, all without deploying. This would result in the migration not running in production.

Workflow in Elixir

Now for a small section specific to Elixir. Ecto provides the dump and load tasks mentioned above, mix ecto.dump and mix ecto.load respectively.

In my applications, I've been doing something like this:

  1. I updated the common Mix aliases for dealing with migrations to take dumping/loading into account. Those aliases look something like this now:

    defp aliases do
      [
        "ecto.setup": ["ecto.create", "ecto.load", "ecto.migrate"],
        "ecto.reset": ["ecto.drop", "ecto.setup"],
        test: [
          "ecto.create --quiet",
          "ecto.load --quiet --skip-if-loaded",
          "ecto.migrate --quiet",
          "test"
        ]
      ]
    end
    

    As you can see, the aliases now always run mix ecto.load before calling mix ecto.migrate. The --skip-if-loaded flag in the test alias ensures that the command is idempotent, that is, can be run multiple times without changing the result.

  2. I added a Mix alias to "dump migrations", that is, dump the current database structure and delete all the current migration files. It looks like this:

    defp aliases do
      dump_migrations: ["ecto.dump", &delete_migration_files/1]
    end
    
    defp delete_migration_files(_args) do
      # Match all files in the 21st century (year is 20xx).
      Enum.each(Path.wildcard("priv/repo/migrations/20*.exs"), fn migration_file ->
        File.rm!(migration_file)
        Mix.shell().info([:bright, "Deleted: ", :reset, :red, migration_file])
      end)
    end
    

    The path wildcard could be improved, or you could have logic that reads the files and checks that they're migrations. However, this does a good-enough job.

Conclusions

If you start to look at database migrations as analogous to Git commits, you can start to treat them that way. We saw how to use the "dumping" and "loading" functionality provided by many database and database frameworks. We saw the advantages and disadvantages of this approach. Finally, I showed you the approach I use in Elixir.

Permalink

Hiding Theory in Practice

2022/11/23

Hiding Theory in Practice

I'm a self-labeled incident nerd. I very much enjoy reading books and papers about them, I hang out with other incident nerds, and I always look for ways to connect the theory I learn about with the events I see at work and in everyday life. As it happens, studying incidents tends to put you in close proximity with many systems that are in various states of failure, which also tends to elicit all sorts of negative reactions from the people around them.

This sensitive nature makes it perhaps unsurprising that incident investigation and review facilitation come with a large number of concepts and practices you are told to avoid because they are considered counterproductive. A tricky question I want to discuss in this post is how to deal with them when you see them come up.

A small sample of these undesirable concepts includes things such as:

  • Root Cause: I've covered this one in Errors are constructed, not discovered. To put it briefly, focusing on root causes tends to narrow the investigation in a way that ignores a rich tapestry of contributing factors.
  • Normative Judgments: this is often used when saying someone should have done something that they have not. It carries the risk of siding with the existing procedure as correct and applicable by default, and tends to blame and demand change from operators more than their tools and support structure.
  • Counterfactuals: those are about things that did not happen: "had we been warned earlier, none of this would have cascaded." This is a bit like preparing for yesterday's battle. It's very often coupled with normative judgments ("the operator failed to do X, which led to ...")
  • Human Error: generally not a useful concept, at least not in the way you'd think. This is best covered in "Those found responsible have been sacked" by Richard Cook or The Field Guide to Understanding 'Human Error', but tends to be the sign of an organization protecting itself, or of a failed investigation. Generally the advice is that if you find human error, that's where the investigation begins, not where it ends.
  • Blame: psychological safety is generally hard to maintain if people feel that they are going to be punished for doing their best and trying to help. You can only get good information if people trust that they can reveal it. Blameless processes—or rather, blame-aware reviews aim to foster this safety.

There are more concepts than these, and each could be a post on its own. I've chosen this list because each of them is an absolutely common reaction, something so intuitive it will feel self-evident to people using them. Avoiding these requires a kind of unlearning, so that you can remove the usual framing you'd use to interpret events, and then gradually learning to re-construct them differently.

This is challenging, and while this is something you and other self-labeled incident nerds can extensively discuss and debate as peers, it is not something you can reasonably expect others to go through in a natural post-incident setting. Most of the people with whom you will interact will never care about the theory as much as you do, almost by definition since you're likely to represent expertise for the whole organization on these topics.

In short, you need to find how to act in a way that is coherent with the theory you hold as an inspiration while being flexible enough to not cause friction with others, nor requiring them to know everything you know for your own work to be effective.

As an investigator or facilitator, let's imagine someone who's a technical expert on the team comes to you during the investigation (before the review) and says "I don't get why the on-call engineer couldn't find the root cause right away since it was so obvious. All they had to do was follow the runbook and everything would have been fine!"

There are going to be times where it's okay to let go of these comments, to avoid doing a deep dive on every opportunity. In the context of a review based on a thematic analysis, the themes you are focusing on should help direct where you put your energy, and guide you to figure out whether emotionally-charged comments are relevant or not.

But let's assume they are relevant to your themes, or that you're still trying to figure them out. Here are two reactions you can have, which may come up as easy solutions but are not very constructive:

  • You may want to police their intervention: since you care for blame-awareness and psychological safety, you may want to nip this behavior in the bud and let them know about the issues around blame, normativeness and counterfactuals.
  • You may also want to ignore that statement, drop it from your notes, and make sure it does not come up in any written form. Just pretend it never came up.

In either case, if behavior that clashes with theoretical ideals is not welcomed, the end result is that you lose precious data, either by omission or by making participants feel less comfortable in talking to you.

Strong emotional reactions are as good data as any architecture diagram for your work. They can highlight important and significant dynamics about your organization. Ignoring them is ignoring potentially useful data, and may damage the trust people put in you.

The approach I find more useful is one where the theoretical points you know and appreciate guide your actions. That statement is full of amazing hooks grab onto:

  • That they believe it is obvious but was not to the on-call engineer hints at a clash in their mental models, which is a great opportunity to compare and contrast them. Diverging perspectives like that are worth digging into because they can reveal a lot.
  • The thought that the runbook is complete and adequate is worth exploring: was the on-call engineer aware of it? Are runbooks considered trustworthy by all? Were they entertaining hypotheses or observing signals that pointed another direction? Is there any missing context?
  • That counterfactual point ("everything would have been fine!") is a good call-out for perspective. Does it mean next time we need to change nothing? Can we look into challenges around the current situation to help shape decision-making in the future?
  • Is this frustrated reaction pointing at patterns the engineer finds annoying? Does it hint at conflicts or lack of trust across teams?
  • Zooming out from the "root cause" with a newcomer's eyes can be a great way to get insights into a broader context: is this failure mechanism always easily identifiable? Are there false positives to care for? Has it changed recently? What's the broader context around this component? You can discuss "contributing factors" even when using the words "root cause" with people.

None of these requires interrupting or policing what the interviewee is telling you. The incident investigation itself becomes a place where various viewpoints are shared. The review should then be a place where everyone can broaden their understanding, and can form their own insights about how the socio-technical system works. Welcome the data, use it as a foothold for more discoveries.

If you do bring that testimony to the review (on top of having used it to inform the investigation), make sure you frame it in a way that feels safe and unsurprising for all participants involved. Respect the trust they've put in you.

How to do this, it turns out, is not something about which I have seen a lot of easily applicable theory. It's just hard. If I had to guess, I'd say there's a huge part of it that is tacit knowledge, which means you probably shouldn't wait on theory to learn how to do it. It's way too contextual and specific to your situation. If this is indeed the case, theory can be a fuzzy guideline for you at most, not a clear instruction set.

This is how I believe theory is most applicable: as a hidden guide you use to choose which paths to take, which actions to prefer. There's a huge gap between the idealized higher level models and the mess (or richness) of the real world situations you'll be in. Navigating that gap is a skill you'll develop over time. Theory does not need to be complete to provide practical insights for problem resolution. It is more useful as a personal north star than as a map. Others don't need to see it, and you can succeed without it.

Thanks to Clint Byrum for reviewing this text.

Permalink

¿Miscelánea o Procrastinación Encubierta?

Estaba pensando en escribir una nueva entrada en el blog, pero de repente me acordé que debía crear un fichero Markdown desde cero para eso y recordé que tenía esa característica en Lambdapad a medio terminar, cuando abrí el editor recordé que no había actualizado la versión de Elixir... ¿te ha pasado?

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.