Show All Telemetry Events in Erlang and Elixir

Telemetry is an Erlang library for dynamically dispatching events to event handlers. Many popular Erlang and Elixir packages use the Telemetry library to emit events. Telemetry event data typically ends up in logs or metric databases like Prometheus.

Events are identified by name. You can register a function as an event handler and get invoked when events with a specific name occur. A name or list of names must be specified when registering a handler using the telemetry:attach/4 and telemetry:attach_many/4 functions.

The Problem

Often it’s not obvious what telemetry events are being generated by a dependency, or even an entire system. Sometimes it is helpful to see all telemetry events the system generates. Unfortunately there is no way to subscribe to ALL events with the telemetry API. We need to know the names of telemetry events before we can register to receive them. And there is no way to get the names of all possible events. Telemetry only invokes handlers registered for a specific event name when emitting an event.

The Solution

Erlang’s tracing features can be used to capture all events emitted. All events are emitted via the telemetry:execute/3 and telemetry:span/3 functions. These two function calls can be traced to get the same event data the telemetry handlers would be invoked with.

To trace these function calls several functions in the dbg module will be used. Tracing all function calls requires invoking these functions with various arguments so it’s easier if we put the tracing code in a function. The function can be named start and a corresponding stop function can be defined to stop tracing telemetry events. It would be nice to have the functions available in the Elixir and Erlang shells when developing locally. It’s pretty easy to do this for both Erlang and Elixir.

For Erlang, copy and paste the following into your user_default module. Note that you’ll need this module compiled before it will be loaded by the Erlang shell:

% telemetry_attach_all/0 prints out all telemetry events received
telemetry_attach_all() ->
  telemetry_attach_all(fun(Name, MetaOrMeasure, MetaOrFun) ->
      % Print out telemetry info
      io:format("Telemetry event: ~w~nwith ~p and ~p~n", [Name, MetaOrMeasure, MetaOrFun])

% telemetry_attach_all/1 allows you to specify a handler function that
% will be invoked with the same three arguments that the
% `telemetry:execute/3` and `telemetry:span/3` functions were invoked
% with.
telemetry_attach_all(Function) ->
  % Start the dbg tracer

  % Create tracer process with a function that pattern matches out the three arguments the telemetry calls are made with.
  dbg:tracer(process, {fun({_, _, _, {_Mod, _Fun, [Name, MetaOrMeasure, MetaOrFun]}}, _State) ->
      Function(Name, MetaOrMeasure, MetaOrFun)
    end, undefined}),

  % Trace all processes
  dbg:p(all, c),

  % Trace calls to the functions used to emit telemetry events
  dbg:tp(telemetry, execute, 3, []),
  dbg:tp(telemetry, span, 3, []).

telemetry_stop() ->

For Elixir define a module in .iex.exs. Copy and paste the code below into your .iex.exs file:

defmodule TelemetryHelper do
  @moduledoc """
  Helper functions for seeing all telemetry events.
  Only for use in development.

  @doc """
  attach_all/0 prints out all telemetry events received by default.
  Optionally, you can specify a handler function that will be invoked
  with the same three arguments that the `:telemetry.execute/3` and
  `:telemetry.span/3` functions were invoked with.
  def attach_all(function \\ &default_handler_fn/3) do
    # Start the tracer

    # Create tracer process with a function that pattern matches out the three arguments the telemetry calls are made with.
         {_, _, _, {_mod, :execute, [name, measurement, metadata]}}, _state ->
           function.(name, metadata, measurement)

         {_, _, _, {_mod, :span, [name, metadata, span_fun]}}, _state ->
           function.(name, metadata, span_fun)
       end, nil}

    # Trace all processes
    :dbg.p(:all, :c)

    # Trace calls to the functions used to emit telemetry events, :execute, 3, []), :span, 3, [])

  def stop do
    # Stop tracer

  defp default_handler_fn(name, metadata, measure_or_fun) do
    # Print out telemetry info
      "Telemetry event:#{inspect(name)}\nwith #{inspect(measure_or_fun)} and #{inspect(metadata)}"

Once you have added to the code above, start a shell with your application running. Then invoke the attach_all function.

In Erlang:

> telemetry_attach_all()

In Elixir:

> TelemetryHelper.attach_all()

Then interact with the system and you’ll see all trace events printed out. Then when finished stop the tracing:

In Erlang:

> telemetry_stop()

In Elixir:

> TelemetryHelper.stop()


The downside to this approach is that it relies on tracing so it’s not suitable for use in application code. But it’s great for use in development. Tracing is an easy way to grab data that is passing through any system on the Erlang virtual machine. It’s an invaluable tool that can speed up debugging in many different situations.


Strong arrows: a new approach to gradual typing

This is article expands on the topic of gradual set-theoretic typing discussed during my keynote at ElixirConf US 2023.

There is an on-going effort to research and develop a type system for Elixir, lead by Giuseppe Castagna, CNRS Senior Researcher, and taken by Guillaume Duboc as part of his PhD studies.

In this article, we will discuss how the proposed type system will tackle gradual typing and how it relates to set-theoretic types, with the goal of providing an introduction to the ideas presented in our paper.

Set-theoretic types

The type system we are currently researching and developing for Elixir is based on set-theoretic types, which is to say its operations are based on the fundamental set operations of union, intersection, and negation.

For example, the atom :ok is a value in Elixir, that can be represented by the type :ok. All atoms in Elixir as represented by themselves in the type system. A function that returns either :ok or :error is said to return :ok or :error, where the or operator represents the union.

The types :ok and :error are contained by the type atom(), which is an infinite set representing all atoms. The union of the types :ok and atom() can be written as :ok or atom(), and is equivalent to atom() (as :ok is a subset of atom()). The intersection of the types :ok and atom() can be written as :ok and atom(), and is equivalent to :ok.

Similarly, integer() is another infinite set representing all integers. integer() or atom() is the union of all integers and atoms. The intersection integer() and atom() is an empty set, which we call none(). The union of all types that exist in Elixir is called term().

The beauty of set-theoretic types is that we can model many interesting properties found in Elixir programs on top of those fundamental set operations, which in turn we hope to make typing in Elixir both more expressive and accessible. Let’s see an example of how a type system feature, called bounded quantification (or bounded polymorphism), can be implemented with set-theoretic types.

Upper and lower bounds

The identity function is a function that receives an argument and returns it as is. In Java, it would be written as follows:

static <T> T identity(T arg) {
    return arg;

In TypeScript:

function identity<T>(arg: T): T {
  return arg;

Or in Haskell:

id :: a -> a
id arg = arg

In all of the examples above, we say the function receives an argument of type variable T (or type variable a in Haskell’s case) and return a value of the same type T. We call this parametric polymorphism, because the function parameter - its argument - can take many (poly) shapes (morphs). In Elixir, we could then support:

$ a -> a
def identity(arg), do: arg

Sometimes we may want to further constrain those type variables. As example, let’s constraint the identity function in Java to numbers:

static <T extends Number> T identity(T arg) {
    return arg;

Or in TypeScript:

function identity<T extends number>(arg: T): T {
    return arg;

In Haskell, we can constrain to a typeclass, such as Ord:

id :: Ord a => a -> a
id x = x

In other words, these functions can accept any type as long as they fulfill a given constraint. This in turn is called bounded polymorphism, because we are putting bounds on the types we can receive.

With all that said, how can we implement bounded polymorphism in set-theoretic types? Imagine we have a type variable a, how can we ensure it is bounded or constrained to another type?

With set-theoretic types, this operation is an intersection. If you have a and atom(), a can be the type :foo. a can also be the type atom(), which represents all atom types, but a cannot be integer(), as integer() and atom() will return an empty set. In other words, there is no need to introduce a new semantic construct, as intersections can be used to place upper bounds in type variables! Therefore, we could restrict Elixir’s identity function to numbers like this:

$ a and number() -> a and number()
def identity(arg), do: arg

Of course, we can provide syntax sugar for those constraints:

$ a -> a when a: number()
def identity(arg), do: arg

But at the end of the day it will simply expand to intersections. The important bit is that, at the semantic level, there is no need for additional constructs and representations.

Note: for the type-curious readers, set-theoretic types implement a limited form of bounded quantification à la Kernel Fun. In a nutshell, it means we can only compare functions if they have the same bounds. For example, our type system states a -> a when a: integer() or boolean() is not a subtype of a -> a when a: integer().

We also get lower bounds for free. If intersections allow us to place an upper bound on a type variable, a union is equivalent to a lower bound as it specifies the type variable will always be augmented by the union-ed type. For example, a or atom() says the result will always include atoms plus whatever else specified by a (which may be an atom, atom() itself, or a completely disjoint type such as integer()).

Elixir protocols, which is an Elixir construct equivalent to Haskell Typeclasses or Java interfaces, is another example of functionality that can be modelled and composed with set-theoretic types without additional semantics. The exact mechanism to do so is left as an exercise to the reader (or the topic of a future blog post).

Enter gradual typing

Elixir is a functional dynamic programming language. Existing Elixir programs are untyped, which means that a type system needs mechanisms to interface existing Elixir code with future statically typed Elixir code. We can achieve this with gradual typing.

A gradual type system is a type system that defines a dynamic() type. It is sometimes written as ? and sometimes known as the any type (but I prefer to avoid any because it is too short and too lax in languages like TypeScript).

In Elixir, the dynamic() type means the type is only known at runtime, effectively disabling static checks for that type. More interestingly, we can also place upper and lower bounds on the dynamic type using set operations. As we will soon learn, this will reveal interesting properties about our type system.

It is often said that gradual typing is the best of both words. Perhaps ironically, that’s true and false at the same time. If you use a gradual type system but you never use the dynamic() type, then it behaves exactly like a static type system. However, the more you use the dynamic() type, the fewer guarantees the type system will give you, the more the dynamic() type propagates through the system. Therefore, it is in our interest to reduce the occurrences of the dynamic() type as much as possible, and that’s what we set out to do.

Interfacing static and dynamic code: the trouble with dynamic()

Let’s go back to our constrained identity function that accepts only numbers:

$ a -> a when a: number()
def identity(arg), do: arg

Now imagine that we have some untyped code that calls this function:

def debug(arg) do
  "we got: " <> identity(arg)

Since debug/1 is untyped, its argument will receive the type dynamic().

debug/1 proceeds to call identity with an argument and then uses the string concatenation operator (<>) to concatenate "we got: " to the result of identity(arg). Since identity/1 is meant to return a number and string concatenation requires two strings as operands, there is a typing error in this program. On the other hand, if you call debug("hello") at runtime, the code will work and won’t raise any exceptions.

In other words, the static typing version of the program and its runtime execution do not match in behaviour. So how do we tackle this?

One option is to say that’s all behaving as expected. If debug/1 is untyped, its arg has the dynamic() type. To type check this program, we specify that identity(dynamic()) returns the dynamic() type, the concatenation of a string with dynamic() also returns dynamic(), and consequently debug/1 gets the type dynamic() -> dynamic(), with no type errors emitted.

The trouble is: this is not a very useful choice. Once dynamic() enters the system, it spreads everywhere, we perform fewer checks, effectively discarding the information that identity/1 returns a number, and the overall type system becomes less useful.

Another option would be for us to say: once we call a statically typed function with dynamic(), we will ignore the dynamic() type. If the function says it returns a number(), then it will surely be a number! In this version, identity(dynamic()) returns number() and the type system will catch a type error when concatenating a string with a number.

This is similar to the approach taken by TypeScript. This means we can perform further static checks, but it also means we can call debug("foobar") and that will return the string "we got: foobar"! But how can that be possible when the type system told us that identity returns a number()? This can lead to confusion and surprising results at runtime. We say this system is unsound, because the types at runtime do not match our compile-time types.

None of our solutions so far attempted to match the static and runtime behaviors, but rather, they picked one in favor of the other.

But don’t despair, there is yet another option. We could introduce runtime checks whenever we cross the “dynamic <-> static” boundaries. In this case, we could say identity(dynamic()) returns a number(), but we will introduce a runtime check into the code to guarantee that’s the case. This means we get static checks, we ensure the value is correct at runtime, with the cost of introducing additional checks at runtime. Unfortunately, those checks may affect performance, depending on the complexity of the data structure and on how many times we cross the “dynamic <-> static” boundary.

Note: there is recent research in using the runtime checks introduced by a gradual type system to provide compiler optimizations. Some of these techniques are already leveraged by the Erlang VM to optimize code based on patterns and guards.

To summarize, we have three options:

  • Calling static code from dynamic code returns dynamic(), dropping the opportunity of further static typing checks (this is sound)

  • Calling static code from dynamic code returns the static types, potentially leading to mismatched types at runtime (this is unsound)

  • Calling static code from dynamic code returns the static types with additional runtime checks, unifying both behaviours but potentially impacting performance (this is sound)

Introducing strong arrows

I have always said that Elixir, thanks to Erlang, is an assertive language. For example, if our identity function is restricted to only numbers, in practice we would most likely write it as:

$ a -> a when a: number()
def identity(arg) when is_number(arg), do: arg

In the example above, identity will fail if given any value that is not a number. We often rely on pattern matching and guards and, in turn, they helps us assert on the types we are working with. Not only that, Erlang’s JIT compiler already relies on this information to perform optimizations whenever possible.

We also say Elixir is strongly typed because its functions and operators avoid implicit type conversions. The following functions also fail when their input does not match their type:

$ binary() -> binary()
def debug(string), do: "we got: " <> string

$ (integer() -> integer()) and (float() -> float())
def increment(number), do: number + 1

<> only accepts binaries as arguments and will raise otherwise. + only accepts numbers (integers or floats) and will raise otherwise. + does not perform implicit conversions of non-numeric types, such as strings to number, as we can see next:

iex(1)> increment(1)
iex(2)> increment(13.0)
iex(3)> increment("foobar")
** (ArithmeticError) bad argument in arithmetic expression: "foobar" + 1

In other words, Elixir’s runtime consistently checks the values and their types at runtime. If increment fails when given something else than a number, then it will fail when the dynamic() type does not match its input at runtime. This guarantees increment returns its declared type and therefore we do not need to introduce runtime type checks when calling said function from untyped code.

When we look at the identity, debug, and increment functions above, we - as developers - can state that these functions raise when given a value that does not match their input. However, how can we generalize this property so it is computed by the type system itself? To do so, we introduce a new concept called strong arrows, which relies on set-theoretical types to derive this property.

The idea goes as follows: a strong arrow is a function that can be statically proven that, when evaluated on values outside of its input types (i.e. its domain), it will error. For example, in our increment function, if we pass a string() as argument, it won’t type check, because string() + integer() is not a valid operation. Thanks to set-theoretic types, we can compute all values outside of the domain by computing the negation of a set. Given increment/1 will fail for all types which are not number(), the function is strong.

By applying this rule to all typed functions, we will know which functions are strong and which ones are not. If a function is strong, the type system knows that calling it with a dynamic() type will always evaluate to its return type! Therefore we say the return type of increment(dynamic()) is number(), which is sound and does not need further runtime checks!

Going back to our debug function, when used with a guarded identity, it will be able to emit warnings at compile-time, errors at runtime, without introducing any additional runtime check:

$ a -> a when a: number()
def identity(arg) when is_number(arg), do: arg

def debug(arg) do
  "we got: " <> identity(arg)

However, if the identity function is not strong, then we must fallback to one of the strategies in the previous section.

Another powerful property of strong arrows is that they are composable. Let’s pick an example from the paper:

$ number(), number() -> number()
def subtract(a, b) do
  a + negate(b)

$ number() -> number()
def negate(int), do: -int

In the example above, negate/1’s type is a strong arrow, as it raises for any input outside of its domain. subtract/2’s type is also a strong arrow, because both + and our own negate are strong arrows too. This is an important capability as it limits how dynamic() types spread throughout the system.

Errata: my presentation used the type integer() instead of number() for the example above. However, that was a mistake in the slide. Giving the type integer(), integer() -> integer() to subtract and integer() -> integer() to negate does not make subtract a strong arrow. Can you tell why?

Luckily, other gradually typed languages can also leverage strong arrows. However, the more polymorphic a language and its functions are, the more unlikely it is to conclude that a given function is strong. For example, in other gradually typed languages such as Python or Ruby, the + operator is extensible and the user can define custom types where the operation is valid. In TypeScript, "foobar" + 1 is also a valid operation, which expands the function domain. In both cases, an increment function restricted to numbers would not have a strong arrow type, as the operator won’t fail for all types outside of number(). Therefore, to remain sound, they must either restrict the operands with further runtime checks or return dynamic() (reducing the number of compile-time checks).

There is one last scenario to consider, which I did not include during my keynote for brevity. Take this function:

$ integer() -> :ok
def receives_integer_and_returns_ok(_arg), do: :ok

The function above can receive any type and return :ok. Is its type a strong arrow? Well, according to our definition, it is not. If we negate its input, type checking does not fail, it returns :ok.

However, given the return type is always the same, it should be a strong arrow! To do so, let’s amend and rephrase our definition of strong arrows: we negate the domain (i.e. the inputs) of a function and then type check it. If the function returns none() (i.e. it does not type check) or a type which is a subset of its codomain (i.e. its output), then it is a strong arrow.

Gradual typing and false positives

There is one last scenario we must take into consideration when interfacing dynamic and static code. Imagine the following code:

def increment_and_remainder(numerator, denominator) do
  rem(numerator, increment(denominator))

$ (integer() -> integer()) and (float() -> float())
def increment(number), do: number + 1

The increment_and_remainder/2 function is untyped, therefore both of its arguments receive type dynamic(). The function then computes the remainder of the numerator by the denominator incremented by one. For this example, let’s assume all uses of increment_and_remainder/2 in our program passes two integers as arguments.

Given increment/1 has a strong arrow type, according to our definition, increment(dynamic()) will return integer() or float() (also known as number()). Here lies the issue: if increment(dynamic()) returns integer() or float(), the program above won’t type check because rem/2 does not accept floats.

When faced with this problem, there are two possible reactions:

  1. It is correct for the function to not type check given increment may return a float

  2. It is incorrect for the function to not type check because the error it describes never occurs in the codebase

Another interesting property of gradual set-theoretic types is that we can also place upper bounds on the dynamic() type. If a function returns number(), it means the caller needs to handle both integer() and float(). However, if a function returns dynamic() and number(), it means the type is defined at runtime, but it must still verify it is one of integer() or float() at compile time.

Therefore, rem/2 will type check if its second argument has the type dynamic() and number(), as there is one type at runtime (integer()) that satisfies type checking. On the other hand, if you attempt to use the string concatenation operator (<>) on dynamic() and number(), then there is no acceptable runtime type and you’d still get a typing violation!

Going back to strong arrows, there are two possible return types from a strong arrow:

  1. A strong arrow, when presented with a dynamic type, returns its codomain

  2. A strong arrow, when presented with a dynamic type, returns the intersection of the codomain with the dynamic() type

The second option opens up the possibility for existing codebases to gradually migrate to static types without dealing with false positives. Coming from a dynamic background, false positives can be seen as noisy or as an indication that static types are not worth the trouble. With strong arrows and gradual set-theoretic types, we will be able to explore different trade-offs on mixed codebases. Which of the two choices above we will adopt as a default and how to customize them is yet to be decided. It will depend on the community feedback as we experiment and integrate the type system.

Erlang and Elixir developers who use Dialyzer will be familiar with these trade-offs, as the second option mirrors Dialyzer’s behaviour of no false positives. The difference here is that our semantics are integrated into a complete type system. If no type signature is present, the dynamic() type is used, and we will leverage the techniques described here to interface dynamic and static code. If a function has a type signature, and no dynamic() type is present, then it will behave as statically typed code when called with statically typed arguments. Migrating to static types will naturally reduce the interaction points between dynamic and static code, removing the reliance on the dynamic() type.


Set-theoretic types allow us to express many typing features based on set operations of union, intersection, and negation.

In particular, we have been exploring a gradual set-theoretic type system for Elixir, paying special attention to how the type system will integrate with existing codebases and how it can best leverage the semantics of the Erlang Virtual Machine. The type system will also perform limited inference based on patterns and guards (as described in the paper), which - in addition to strong arrows - we hope to bring some of the benefits of static typing to codebases without changing a single line of code.

While our efforts have officially moved from research into development, and we have outlined an implementation plan, we haven’t yet fully implemented nor assessed the usability of set-theoretic types in existing Elixir codebases, either large or small. There is much to implement and validate, and we don’t rule the possibility of finding unforeseen deal breakers that could send us back to square one. Yet we are pleased and cautiously excited with the new developments so far.

The development of Elixir’s type system is sponsored by Fresha (they are hiring!), Starfish* (they are hiring!), and Dashbit.


Erlang/OTP 26.1 Release

OTP 26.1

Erlang/OTP 26.1 is the first maintenance patch package for OTP 26, with mostly bug fixes as well as improvements.

For details about bugfixes and potential incompatibilities see the Erlang 26.1 README

The Erlang/OTP source can also be found at GitHub on the official Erlang repository,

Download links for this and previous versions are found here


Language Design: When Less is More

Yesterday I was invited to do an interview with the man behind Erlang Punch, Mathieu Kerjouan. Mathieu is a truly great programmer, an absolutely excellent interviewer, and it was a lot of fun to get to do something task-focused with him like this. He asked some pretty broad questions about Erlang as a language, as an ecosystem, and its role in my professional development and journey as a programmer.

He asked what I like and what I don’t like about the language, and as anyone who knows me well as a programmer may imagine, I gave a rather extensive answer. In the course of giving my answer the main theme that emerged was that Erlang’s greatest trait as a language is its simplicity and consistency, and the worst parts are where this simplicity and consistency is violated. Most of the complexity of the system is actually in OTP as a system, rather than Erlang itself as a language, and this seems to be a profoundly beneficial tradeoff.

One point of inconsistency I mentioned as a dislike was the the shadowing of variable names declared in an outer scope in the heads of lambda functions, and another was the introduction of the alien ?= operator instead of relying on functional constructs that are far more flexible (such as adding a pipeline library module to the standard library). I went into some detail in my answer about the exact reasons why these are examples of stupid warts in the language and the history behind them so I won’t recount them here, but it is sufficient to say that there is a cardinal sin being committed in these two cases: the crime of inconsistency.

To expand on that point a bit, I would like to illustrate briefly how it would actually be possible to reduce Erlang’s syntax to bring even further consistency to it. I’m not saying this is something I’m going to write an EEP about any time soon (the juice just isn’t worth the squeeze), but it is illustrative of how easily a minimalist approach can be applied to language design in ways that comply with core elements of the language’s existing fundamentals. In fact the point of thinking this way is to explore and discover what those fundamentals might actually be as this is not always immediately obvious, even to someone experienced with using it or (ironically) the language’s original designer.

Let’s consider the syntax of Erlang’s “send” and “receive” operations.

Currently there is a magic token ! for “send” and another magic token, the word receive, for “receive”. Imagine if instead of these magical tokens we had the following functions as BIFs:

send(Recipient, Message) ->
    Recipient ! Message,
receive() ->
    receive Any -> Any end.
receive(Timeout) ->
        Any           -> {ok, Any};
        after Timeout -> timeout

Suddenly we realize that these could actually be BIFs, privileged with whatever magical implementation or compilation cheats we might come up with, and invoked very simply as:

ok = send(Recipient, Message),
Message = receive(),
case receive(10000) of
    {ok, Message} -> % …
    timeout       -> % …

It would be possible to implement the above functions exactly as demonstrated above and use them without any interruption or loss of meaning. This would reduce the syntactic complexity of the language, and in fact in the case of send/2 there actually is a function in the standard library that has turned the magic ! syntax into an optional function for exactly this reason: because it is useful in a functional context to have functions with which to work!

Is there anything else we can (and in a better world, probably should) adapt in this way?

Why yes.
Yes, in fact there is: the notorious try which we somehow got along just fine without once upon a time…

try(Operation) ->
    try {ok, Operation()}
    catch C:E:S -> {C, E, S}

This could be written any number of other ways (in the old days this would have been speculative execution in a monitored process spawned specifically for this reason, for example, which you can still do and parallel map implementations actually do), and of course we would want to make it into a magic BIF almost certainly, but my point here is that the complexity can be pushed back into the compiler and system rather than remaining a point of syntactic complexity that has its own little magic (and totally unique) notation that even graybeards occasionally forget the details of and have to look up on a cheatsheet from time to time (“wait, can we still call a function to get a stacktrace or is that in the… ohhhh yeah, it’s a syntax thing now… [typing noises]”).

Protip for language designers: the more things a programmer has to remember about your language’s little quirks the less mental bandwidth they have remaining for whatever problem they are actually trying to solve.

Why even have such inconsistency in the language when other features of the language that take advantage the existing syntax so well could simply have been implemented instead (and often retrospectively actually are, such as with send/2)?

In a word: habit

This is a bad habit. Languages in general, not just Erlang, would benefit greatly from a reduction in magic syntax and glyphy operators and we should move to minimize languages rather than add every allegedly “good idea” that the Fairy of Warped Ego poops out on our shoulders.

Try it out…

It’s really not all that insane.
C’mon, man, everyone’s doing it…



send(Recipient, Message) ->
    Recipient ! Message,

recv() ->
    receive Any -> Any end.

recv(Timeout) ->
        Any           -> {ok, Any}
        after Timeout -> timeout

tryy(Operation) ->
    try {ok, Operation()}
    catch C:E:S -> {C, E, S}


Type system updates: moving from research into development

A year ago, at ElixirConf EU 2022, we announced an effort to research and develop a type system for Elixir (video presentation) (written report).

This work is happening under the lead of Giuseppe Castagna, CNRS Senior Researcher, and taken by Guillaume Duboc as part of his PhD studies, with further guidance from myself (José Valim).

This article is a summary of where we are in our efforts and where we are going.

Out of research

Our main goal during research is to find a type system that can model most of Elixir’s functional semantics and develop brand new theory on the areas we found to be incompatible or lacking. We believe we were able to achieve this goal with a gradual set-theoretic type system and we are now ready to head towards development. Over the last 2 months, we have published plenty of resources on our results:

Our focus so far has been on the semantics. While we have introduced a new syntax capable of expressing the semantics of the new set-theoretic type system, the syntax is not final as there are still no concrete plans for user-facing changes to the language. Once we are confident those changes will happen, we will have plenty of discussion with the community about the type system interface and its syntax.

The work so far has been made possible thanks to a partnership between the CNRS and Remote, with sponsorships from Fresha, Supabase, and Dashbit.

Into development

While there is still on-going research, our focus for the second semester of 2023 onwards is on development.

Incorporating a type system into a language used at scale can be a daunting task. Our concerns range from how the community will interact and use the type system to how it will perform on large codebases. Therefore, our plan is to gradually introduce our gradual (pun intended) type system into the Elixir compiler.

In the first release, types will be used just internally by the compiler. The type system will extract type information from patterns and guards to find the most obvious mistakes, such as typos in field names or type mismatches from attempting to add an integer to a string, without introducing any user-facing changes to the language. At this stage, our main goal is to assess the performance impact of the type system and the quality of the reports we can generate in case of typing violations. If we are unhappy with the results, we still have time to reassess our work or drop the initiative altogether.

The second milestone is to introduce type annotations only in structs, which are named and statically-defined in Elixir codebases. Elixir programs frequently pattern match on structs, which reveals information about the struct fields, but it knows nothing about their respective types. By propagating types from structs and their fields throughout the program, we will increase the type system’s ability to find errors while further straining our type system implementation.

The third milestone is to introduce the (most likely) $-prefixed type annotations for functions, with no or very limited type reconstruction: users can annotate their code with types, but any untyped parameter will be assumed to be of the dynamic() type. If successful, then we will effectively have introduced a type system into the language.

This new exciting development stage is sponsored by Fresha (they are hiring!), Starfish* (they are hiring!), and Dashbit.


Embrace Complexity; Tighten Your Feedback Loops


Embrace Complexity; Tighten Your Feedback Loops

This post contains a transcript of the talk I wrote for and gave at QCon New York 2023 for Vanessa Huerta Granda's track on resilience engineering.

The official talk title was "Embrace Complexity; Tighten Your Feedback Loops". That’s the descriptive title for the talk that follows the conference’s guidelines about good descriptive titles. Instead I decided to follow my gut feeling and go with what I think really explains my perspective and the approach I bring with me to work and even my life in general:

I take what would probably be a sardonic approach to dealing with life and systems, and so “This is all going to hell anyway” is pervasive to my approach. Things are going to be challenging. There are going to always be pressures that keep pushing our systems to the edge of chaos. I don’t think this can be fixed or avoided. Any improvement will be used to bring it right to that edge. In complex systems, the richness and variability is often there for a reason. Trying to stamp it out in favour of stronger control is likely to create weird issues.

So the best I personally hope for is to have some limited influence in steering things the best I can to delay going to hell as long as possible, but that’s it. And my talk is going to focus on a lot of these approaches, but first, I want to explain why I feel things are that way.

In what is probably my favorite paper ever, titled Moving Off The Map, Ruthanne Huising ran ethnological studies by embedding herself into projects within many large corporations doing planned organizational changes. In supporting these efforts, they were doing “tracing” of their functions, which meant gathering a lot of data about what activities take place, what interactions and hand-offs exist, what information and tools are used and required? How long do tasks take? How do people and teams deal with errors? Generally asking the question “what do we do here?” and wondering with whom they do it.

To build these maps they generally reached out to experts within the organization who were supposed to know how things were working. Even then, they were really surprised.

One explained that “it was like the sun rose for the first time… I saw the bigger picture.” Participants had never seen the pieces (jobs, technologies, tools, and routines) connected in one place, and they realized that their prior view was narrow and fractured, despite being considered experts.

Others would state that “the problem is that it was not designed in the first place.” The system was not designed nor coordinated, but generally showed the result of various parts of the organization making their own decisions, solving local problems, and adapting in a decentralized manner.

The last quote comes from events when a manager at one of the organizations walked the CEO through the map, highlighting the lack of design and the disconnect between strategy and operations. The CEO sat down, put his head on the table, and said, “This is even more fucked up than I imagined.” He realized that the operation of his organization was out of his control, and that his grasp on it was imaginary.

One of the most surprising results reported in there was about tracking the people who participated in organizing and running the change projects, and seeing who got promoted, who left, and who moved around the org or industry they were in.

She found out there were two main types of outcome. The first group turned out to be filled with people who got promotions. They were mostly folks who worked in communications, training, who managed the costs and savings of the projects, or those who helped do process design. Follow-up interviews revealed that most of them attributed their promotions to having worked on a big project to put under their belt, and to frequently working with higher-ups, which both helped with getting promoted.

Another group however mostly contained people who moved to the periphery: away from core roles at the organization, sometimes becoming consultants, or leaving altogether. Those who fit this category happened to be the people who collected the data and created the map. They attributed their moves to either feeling like they finally understood the organization better, felt more empowered to change things, or became so alienated by the results they wanted to get out.

So the question of course became how come people who feel they understand how the organization truly works and who want to change it move away from the central roles and positions, and into the peripheral ones?

The fatal insight, according to Huising, is something sociologists knew for a good while: the culture and the order imposed to organizations, groups, and even societies is often emergent and negotiated. And while it's obvious that these structures dictate a lot of actions, the actions themselves can preserve or change the structures around them.

The feelings of empowerment and alienation come in no small part because people realized that they could change a lot more than they could, albeit often from outside the core decision-making that enforces the structure (while understanding how that core works), or because the ways they thought they were impacting things was shown not to be effective and they felt disembedding.

Another thing you have possibly experienced and isn’t in the paper now is one of differentiating between the nominal and actual structure of the org, the emergent one that depends on power dynamics, who knows what or whom, who likes or dislikes each other, and so on.

If you've ever worked in a flat organization, like the one in the middle here, is that even though you have little management structure to speak of, power dynamics and decision-making authority still exists. People who have no power attached to their role are still going to be consulted or inserted in the decision-making flow of the organization, they're still going to be influential and have the ability to make or break projects, but just with less obvious accountability.

The nominal structure is the one where each level of management and within the organizational ladder specifies how information flows, and how authority is applied. It's what we see on the left in a more traditional org structure, and this way of organizing groups will simultaneously be useful to align efforts and to constrain them. It makes accountability more explicit and transparent, but structurally will prevent people from doing unspecified things, whether they would be harmful or useful.

The emergent structure is always there as well. It is implicit, always changing, and not necessarily constrained to your own organization either. Sometimes, people who know how to run, maintain, or operate components, or whom people listen to, are not even in your org anymore. They might have moved away (to a different team or even a competitor), retired, or never been in and they have just published a really influential piece of media and people look up to them.

But who knows what, works with whom, and who can move things around in specific contexts can be key to successful initiatives. Even if the organizational structure has often been put in place to constrain change, as a barrier to people working in mis-aligned ways, some folks central to the emergent structure, in key contexts, have earned enough trust to be allowed tacitly to bend and break the rules. They can choose not to enforce the rules, or the rules are not enforced as tightly for them with the hopes of positive outcomes—even if sometimes it can get you the opposite result.

I’m not here to argue in favor of one or the other structure, but mostly that in my experience, driving change or making initiatives succeeds the most when catering to both structures at once, or rather fails when only looking at one and being blocked by the other. They're both real, both distinct, and pretending only either exists is bound to cause you grief.

As a continuation of this, the way people work every day is often different from the way people around them imagine their work is being done. The gap between how work is thought to be done and how it is actually done is a major but generally invisible factor in how systems work out.

Based on flawed mental models of the work, procedures and prescriptions are given about how to do work, and will vary in inaccuracy. People will imagine things like, for example, writing all the tests before writing or modifying any code and that code coverage could be ideal and then that it will all be reviewed in depth by an expert, and will enshrine this as a policy.

But the application of these policies is never perfect. Sometimes code doesn't have an owner, or due to crunch time and based on how much the reviewer and author trust each other, the review won't be as in-depth as expected.

When you see this mismatch causing people to ignore or bend rules, you can choose to apply authority and ask for a stricter rule-following. This pattern of enforcing the rules harder will likely drive these adaptations underground rather than stamping them out, because real constraints drive that behavior.

In turn, the work as disclosed will be less adequate, and the work as imagined progressively gets worse and worse.

This becomes a feedback loop of misunderstanding and at some point, like our devastated CEO, you’re not managing the real world anymore.

To demonstrate this, earlier this year I went to my local mastodon network—so you know this is super scientific—and ran a poll about time sheets. The question was "If you're a software developer who ever worked for an employer who had you track your time hourly into specific projects/customer accounts and you were short on time budget, did you..."

Multiple answers were accepted. Fewer than 15% of people either stopped work, worked without tracking their time anymore (for free), or shifted their time into other projects with more buffer space.

Roughly a third of people reported billing anyway, some stating that it's not their problem the time allocation wasn't realistic or adequate.

But the vast majority of answers, nearly 60%, came from people saying "my time tracking was always fake and lies," with some people stating they even wrote applications to generate realistic-looking time sheets.

What we can see here is an example of how work-as-imagined gets translated into policies ("people do their work in projects, and account for their time"), which at some point doesn't get applied right anymore. If I were to suppose, it could be things like not being allowed to go over time, or just finding the practice useless. But the end result is that the time sheet data just isn't trustworthy, and then it can get used again and again in further decision making.

The gap widens, and our CEO might also get to think "this is all fucked up."

Part of the reason for this is that every day decisions are made by trying to deal with all sorts of pressures coming from the workplace, which includes the values communicated both as spoken and as acted out. People generally want to do a good job and they’ll try to balance these conflicting values and pressures as well as they can.

The outcome of that trade-off being a success or a failure isn’t known ahead of time, but these small decisions accumulate based on the feedback we get from each of these and can end up compounding and accumulating, either as improvements, or as erosion that makes organizations more brittle, or really anywhere in between. People adopt the organization’s constraints as their own, and this set of pressures is the kind of stuff that drives processes to the edge of chaos over and over again.

These accumulations of small decisions, these continuous negotiations, that’s one way your culture can define itself. Small common everyday acts and small amounts of social pressure you can apply locally has an impact, as minor as it might be, and compounds. You can easily foster your own local counterculture within a team if you want to. This can both be good (say in Skunkworks where you bypass a structure to do important work) or bad (normalizing behaviors that are counterproductive and can create conflict).

So while a lot of the work you can do to improve reliability or resilience as a whole can be driven locally, my experience is that you nevertheless get the best results by also aligning with or re-aligning some of the organizational pressures and values usually set from above.

The idea here is to start looking at the organization from both ends: how can we support the people dealing with the trade-offs in conflicting goals as they happen, how can we influence the higher-level values and pressures such that we can try to reduce how often these conflicts happen even though they will definitely keep happening, and how can we better carry context and feedback across both ends so that we constantly adjust as best as we can. A system perspective on interactions, rather than focusing on components is also something I've found useful. The rest of the talk is going to be spent on these ideas.

(as a note, the third drawing is Dimethylmercury, a highly volatile, reactive, flammable, and colorless liquid. It's one of the strongest known neurotoxins, and less than 0.1 mL is enough to kill you through your skin, and gloves apparently do a bad job at protecting you)

So let's start with negotiating trade-offs, with a bit more of an ops-y perspective, because that's where I'm coming from.

This is a painful one sometimes, especially when you have highly professional people who take their jobs seriously.

Locally for you as a DevOps or SRE team, there is a need for the awareness of what the organization and customers actually care about. Some availability targets become useless metrics because they’re disconnected from what users want, and you’re just going to burn people out doing it.

I learned this lesson when talking to the SRE manager of one of these websites where people pick their favorite images, put them on boards, and get shown ads. He was telling me how their site was having a lot of reliability issues. It would keep going down, his team would do heroics to bring it back up, and it'd open all over again.

He felt his team was burning out. They were losing people, and their call rotation was so painful they were also having issues hiring back into it. He was seeing the death spiral happening and was wondering what to do.

He added that there were perverse incentives at play: every time the site went down, they stopped showing images, but not ads. That meant that during incidents, they still earned money, but no longer paid for bandwidth. The site was more profitable when it failed than when it worked, and seemingly, users didn't mind much.

They were not getting help, nobody seemed to consider it a problem. Not really knowing what to say, I just asked off-hand: "are you trying to deliver more reliability than people are asking for? What if you just stopped and let it burn more and rested your people?" He thought about it seriously, and said "yeah, maybe."

I never actually found out what happened after this, but it still stuck with me as a really good question to ask from time to time.

In some cases, the answer will be "yes, we want to be this reliable". But you just won't be given the right tools to do it.

At Honeycomb, we want on-call rotations to have 5-8 people on them because that’s what we think gives a good pace that maintains a balance between how rested and how out-of-practice people can be. Not too often nor not often enough.

But many services are owned by smaller teams of 3-4 people. If we wanted rotations to be made of people who know all their components in depth, where they could build expertise and operate what they wrote, we couldn't reach a sustainable frequency.

Instead, to keep the pace right, we tend to put together rotations made of multiple teams, for which people won’t understand many of the components they operate. This in turn makes us prepare to deal with more unknown: fewer runbooks, more high-level switches and manual circuit breakers to gracefully degrade parts of the system to keep it running off-hours, and with different patterns of escalation.

We started leaning more heavily on this when a big public product launch required shipping a new feature, which was to be operated by a team that didn't have full time to get it operationally ready. When our SRE team was discussing with them what still needed to be done, we asked for a few simple things: a way to switch the feature off for a single customer, and a way to turn it off entirely, that wouldn't break the rest of the product. The rest we could add as we went.

We ended up using these switches a few times, one of which prevented a surprising write-amplification bug that could have killed the whole system, and instead let us wait a few hours for the code owners to get up and fix it at a leisurely pace. We're going to accept a bit of well-scoped, partial unavailability—something that happens a lot in large distributed systems—in order to keep the system stable.

The person wearing the pager often does triage and that weird issues will eventually be handled by code owners, just not right now.

This approach means that rather than working impossible hours and making inhuman efforts foreseeing the unforeseeable, we keep moving rather fast, gather feedback, find issues, and turn around a bit more on a dime. In order to do this though, there’s a general understanding that production issues may turn parts of the roadmap upside down, that escalations outside of the call rotation can disrupt project work, and so on.

That’s one of the complex trade-offs we can make between staffing, training/onboarding, capacity planning, iterative development, testing approaches, operations, roadmap, and feature delivery. And you know, for some parts of our infra we make different decisions because the consequences and mechanisms differ.

To make these tricky decisions, you have to be able to bring up these constraints, these challenges, and have them be discussed openly without a repression that forces them underground.

One of my favorite examples is from a prior job, where one of my first mandates was to try and help with their reliability story. We went over 30 or so incident reports that had been written over the previous year, and a pattern that quickly came up was how many reports mentioned "lack of tests" (or lack of good tests) as causes, and had "adding tests" in action items.

By looking at the overall list, our initial diagnosis was that testing practices were challenging. We thought of improving the ergonomics around tests (making them faster) and to also provide training in better ways to test. But then we had another incident where the review reported tests as an issue, so I decided to jump in.

I reached out to the engineers in question and asked about what made them feel like they had enough tests. I said that we often write tests up until the point we feel they're not adding much anymore, and that I was wondering what they were looking at, what made them feel like they had reached the points where they had enough tests. They just told me directly that they knew they didn't have enough tests. In fact, they knew that the code was buggy. But they felt in general that it was safer to be on-time with a broken project than late with a working one. They were afraid that being late would put them in trouble and have someone yell at them for not doing a good job.

When I went up to upper management, they absolutely believed that engineers were empowered and should feel safe pressing a big red button that stopped feature work if they thought their code wasn't ready. The engineers on that team felt that while this is what they were being told, in practice they'd still get in trouble.

There's no amount of test training that would fix this sort of issue. The engineers knew they didn't have enough tests and they were making that tradeoff willingly.

(note: this slide was cut from the presentation since I was short on time)

Speaking of which, sometimes it’s also fine to drop reliability because there are bigger systemic threats.

Sometimes you can eat downtime or degraded service because it’s going to keep your workload manageable and people from burning out. or maybe you take a hit because a big customer that makes you hit your targets as an org and can prevent layoffs will put some things over the limit and a component’s performance will suffer. You can’t be the department of “no” and that negotiation has to be done across departments.

Conversely however, you have to be able to call out when your teams are strained, when targets aren’t being met and customers are complaining about it. It means you might be right, and some deadlines or feature delivery could be deferred to make room for others.

How do you deal with capacity planning when making your biggest customer renew their contract prevents you from signing up another one that’s as big? Very carefully, by talking it out by all the involved people.

And sometimes that trade-off is very reasonable. And good engineering requires you to move it earlier in the lifecycle of software than just around incidents. It’s much simpler to change the shape of a product’s features than it is to deliver the perfect distributed systems sometimes. Making your features take the ideal shape to deal with the reality of physics is one of the things a good collaborative approach can facilitate.

So we can make tradeoff negotiation simpler by having these honest discussions, but in many cases this ability to discuss constraints to influence how work takes place brings us to this next step, where we don’t only influence the decisions people make, but surface these challenges to influence how the organization applies its pressures. This is moving from the local level to the alignment to the broader org structure.

Metrics are good to direct your attention and confirm hypotheses, but not as a target, and they’re unlikely to be good for insights. They’re compression, and it can be unreliable.

The thing you generally care about is your customer or user's satisfaction, but there's a limit to how many times you can ask "would you recommend us to a friend?" and still get a good signal. So you start picking a surrogate variable.

You assume that when the site is down and slow, people are mad, and you make being up and fast a proxy for satisfaction. But then that signal is a bit messy and not super actionable, because it can include user devices or bits of the network you don't control, plus it's hard to measure, so you'll settle for response time at the edge of your infrastructure. This loses fidelity into the signal, but it'll get worse as you suddenly find some teams have more data than others, and they use features differently, so you either need a ton of alarms or fewer messier ones, but you're getting further and further away from whether people are actually satisfied.

This loss of context is a critical part of dealing with systems that are too complex to adequately be represented by a single aggregate. Whenever a signal is useful, an in-depth dive is usually worth it if you are looking to embrace complexity.

The metric is better used to attract your attention than as a target or as something that tells you what to know. Seek to explain and understand the metric first, not to change it.

As a related concept, if you act on a leading indicator, it stops leading, particularly when it’s influenced by trade-offs.

Metrics that become their own targets and are gamed of course lose meaningfulness; this is one of the most common issues with counting incidents and then debating whether an outage should or shouldn’t be declared in a way that might affect the tally rather than addressing it directly.

But other metrics are of interest as well. If you evaluate your total capacity by some bottleneck’s value, and that this bottleneck is a target of optimization work, you will lose the ability to easily know when or how to scale up because that bottleneck possibly hid something else. This is contributing to a non-negligible portion of our incidents at work I believe. We fix a thing that acted as an implicit blocker and off we go into the great unknown.

Our storage engine's disk storage used to be our main bottleneck. We drove scaling out and rebalancing traffic based on how close we were to heavy usage across multiple partitions. This was a useful signal, but it also drove costs up, and eventually became the target of optimization.

An engineer successfully made our data offloading almost an order of magnitude faster, and eliminated our most glaring scaling issues at the time. Removing this limit however messed with our ability to know when to scale, which then revealed issues with file descriptors, memory, and snapshotting times.

The only good advice I have here is to re-evaluate your metrics often, and change them. I guess there’s also a lesson to be learned that improvements can also cause their own uncertainty and that these successes can themselves lead to destabilizations.

Because we no longer needed to scale out as aggressively and were free to discover new issues, and one of our best improvements to the system in recent memory is therefore also a contributor to a lot of operational challenges.

Things that people think are useful are possibly going to happen even if you forbid them. If you forbid people from logging onto production hosts, and they truly think they'll need it for emergency situations, they'll make sure there's still a way for it to happen, albeit under a different name.

On the other hand, things that people think are useless are likely to be done in a minimal way with no enthusiasm, such as lying in your timesheets.

This means that writing a procedure means little unless people actually see its value and believe it’s worth following. Conversely, it means that if you can demonstrate the usefulness and make some approaches more usable, they’re likely to get adopted regardless of what is written down as a list of steps or procedures.

A related concept here is one here is that if you are tracking things like action items after an incident reviews and they go in the backlog to die, it may not be that your people are failing to follow through; it might also be that it’s impractical to do so, or it’s could also be that these action items were never feeling useful, and the process itself needs to be revisited rather than reinforced.

Seeing non-compliance is not necessarily a sign of bad workers. It may rather be a sign of a bad understanding of the workers' challenges, and point to a need to adjust how work is prescribed.

Getting a small real buy-in into something voluntary may be better than getting fake buy-in into something you’re forcing people to do. Of course if you manage to write a good procedure that people believe are worth following, more power to you, this is going great.

The shortest feedback loop may be attained by giving people the tools to make the right decisions right there and then, and let them do it. Cut the middlemen, including yourself.

How do you make that work? We come back to goal alignments and top priorities being harmonized and well understood. If the pressures and goals are understood better, the decisions made also work better.

That does mean that you have to listen back about how these things have been going, and that not only do you need to trust your people, but they need to trust you back with critical and unpleasant information as well. The feedback flows both ways, and this hinges on psychological safety.

If you've ever talked to a contractor asked to help a big organization, the first thing they'll tell you they do is go talk to the workers with boots on the ground, and ask them what they think needs changing. They'll often have years of potential improvements backlogged, and that they're ready to tell anyone about. Either because management wouldn't listen to it, or because the workers lost trust that voicing that feedback would yield any result.

Then the contractor brings it up to management as a neutral party, and suddenly it gets listened to and acted upon.

If you've lost that trust, then contractors can play that specific role of workers at the periphery of the organization helping drive change, and they can play a very useful function.

But if you have that trust already, maintaining it is crucial because that’s how you get all the good information to help orient and influence things.

Trust also means that if you want people to be innovative, you have to allow them to make mistakes. You can’t get it right the first time all the time; if people can’t be allowed to get it wrong here and there, they won’t be allowed to improve and try new things either.

Finally, let's look at shifting perspective away from a bare analysis and onto a more systemic point of view. People in specific teams often have a more detailed expert view than you could either have, but if you're standing outside of it, your strength might be to understand how the parts interact in a way that isn't visible to the inside.

The most basic point here is that you can’t expect to change the outcome of these small little decisions that accumulate all the time if you never address the pressures within the system that foster them.

I used to try and weed my lawn a whole hell of a lot and pull the weeds hours a week until someone explained to me that weeds grew easier in the type of soil I had (poor, dry, unmaintained soil) than grass, and pulling the weeds wasn’t the way to go, I needed to actually make the soil good for the grass to crowd out the weeds.

It's similar when considering this whole idea of root cause analysis—of trying to find the one source of the problem and removing it. If your root cause is at the weed’s level, you’ll keep pulling on them forever and will rarely make decent progress. The weeds will keep growing no matter how many roots you remove.

If you foster good soil, if you create the right environment that encourages the type of behavior you want instead of the type of behaviour you dislike, you have hopes that the good stuff will crowd out the bad stuff. That’s a roundabout way of talking about culture change. And for these, deep dives based on richer narratives and thematic analysis prove more useful.

Also there's a warning here about trying to change the decisions your people make with carrots and sticks—with incentives. They are not going to fundamentally change what pressures the employees negotiate. The pressures stay the same, all you're doing is adding more of them, either in the form of rewards or punishments, which makes decision-making more complex and trickier.

Chances are people will keep making the same decisions as they were already, but then they'll report it differently to either get their bonus or to avoid getting penalized for it. Surfacing, understanding, and clarifying goal conflicts can make things easier or shape work to give them more room. Adding carrots and sticks can make things harder.

But the tip here is probably: look into what are the behaviors you want to see happen, and give them room to grow.

My most successful initiative at Honeycomb is probably creating weekly discussion sessions about operational stuff and on-call. They range from “how do we operate new service X” into trickier discussions like “is it okay to be visibly angry in an incident”, “how do you deal with shit you don’t know or avoid burnout” or “are there times where code freezes are actually a useful thing?”.

Over time we looked into all sorts of weird interactions and the meeting became its own tool.

When we noticed incident reviews were difficult to schedule across departments and timezones, we decided that a good wide incident review is good operational talk and started making the optional time slot, which was already on every engineer's calendar (and some other departments too), available for them. It became easier for people to run incident reviews, and over time their size grew from 7-8 people, scoped to 1 or 2 teams, to bigger events with 20 to 40 people in them.

We removed a huge but subtle blocker to good feedback loops existing within the organization.

These sorts of small changes are those you can drive locally with almost no risk of having them run afoul of organizational priorities, and when you see them work, use the org structure to expand them everywhere.

I find it useful to keep focusing on what an indicator triggers as a behavior (the interaction) rather than only what it reports directly. This slide here is 4 error budgets from our SLOs, which combine how successful requests are both in terms of speed and errors, compared to an objective we express in terms of the desired fault rate.

When we have to pick targets for our platform, people often ask whether we could pick some key SLOs and turn them as the objective. My answer is almost always "I don't care if we meet the SLOs or not". I mean I care, but not like that.

SLOs aren’t hard and fast rules. When the error budget is empty, the main thing that matters to me is that we have a conversation about it, and decide what it is we want to happen from there on. Are we going to hold off on deploys and experiments? Are we able to meet the objectives while on-call, with some schedule corrective work, some major re-architecting? Can we just talk to the customers? Were our targets too ambitious or are we going to eat dirt for a while?

Kneejerk automated reactions aren’t nearly as useful as sitting down and having a cross-departmental discussion about what it is we want to do, as an organization, about these signals of unmet expectations. If it fits within on-call duty, like what is probably the case with the error budget on the top left, then fine.

But in other cases, such as the top right budget here, which seems to show a gradual decline, owe have to choose whether to do corrective work (and how/when) to meet the SLO—because that wasn't expected and is undesirable—or maybe to relax it—because that's actually a natural consequence of new more expensive features and we need to tweak definitions. Or we could temporarily ignore it because corrective work is already on the way, but not a top priority right now.

The two budgets at the bottom come from SLOs that may never page anyone. But from time to time, we re-calibrate them by asking support whether there are any issues users complain about that we aren't already aware of. So long as we're ahead of the complaints, we figure the SLOs are properly defined. But from time to time, we find out that we slipped by getting comments on things our alerting never properly captured. Or maybe we needed to better manage the user's expectations—that's also an option.

For any of these choices, we also have to know how this is going to be communicated to users and customers, and having these discussions is the true value of SLOs to me. SLOs that flow outside of engineering teams provide a greater feedback loop about our practices, further upstream, than those that are used exclusively by the teams defining them, regardless of their use for alerting.

Finally, this is where SREs can be placed in a great way to shine. You can be away from the central roles, away from the decision-making, on the periphery. By being outside of silos and floating around the organization’s structure, you are allowed to take information from many levels, carry it around, and really tie the loop at the end of so many decisions made in the organization by noting and carrying their impact back once they’ve hit a production system.

It is an iterative exercise, our sociotechnical systems are alive, and carrying pertinent signals and amplifying them, you can influence how long it’s gonna take before it all goes to hell anyway.


Elixir v1.15 released

Elixir v1.15 has just been released. 🎉

Elixir v1.15 is a smaller release with focused improvements on compilation and boot times. This release also completes our integration process with Erlang/OTP logger, bringing new features such as log rotation and compression out of the box.

You will also find additional convenience functions in Code, Map, Keyword, all Calendar modules, and others.

Finally, we are glad to welcome Jean Klingler as a member of the Elixir Core team. Thank you for your contributions!

Compile and boot-time improvements

The last several releases brought improvements to compilation time and this version is no different. In particular, Elixir now caches and prunes load paths before compilation, ensuring your project (and dependencies!) compile faster and in an environment closer to production.

In a nutshell, the Erlang VM loads modules from code paths. Each application that ships with Erlang and Elixir plus each dependency become an entry in your code path. The larger the code path, the more work Erlang has to do in order to find a module.

In previous versions, Mix would only add entries to the load paths. Therefore, if you compiled 20 dependencies and you went to compile the 21st, the code path would have 21 entries (plus all Erlang and Elixir apps). This allowed modules from unrelated dependencies to be seen and made compilation slower the more dependencies you had. With this release, we will now prune the code paths to only the ones listed as dependencies, bringing the behaviour closer to mix release.

Furthermore, Erlang/OTP 26 allows us to start applications concurrently and cache the code path lookups, decreasing the cost of booting applications. The combination of Elixir v1.15 and Erlang/OTP 26 should also reduce the boot time of applications, such as when starting iex -S mix or running a single test with mix test.

As an example, I have benchmarked the Livebook application on a M1 Max MacStudio across different Elixir and Erlang/OTP versions. At the time of benchmarking, Livebook had ~200 source .ex files and ~35 dependencies. Compilation-times were improved by 16%:

Livebook compilation times

Livebook saw an improvement of 30% on boot times:

Livebook boot times

Different application will see different results. Our expectations are the gains will be more meaningful the more dependencies you have, the more files you have, and the more cores you have. We have even received reports of up to 40% faster compilation times, although it is yet unclear how generalizable this will be in practice. Note this work does not improve the time to compile slow individual files.

The compiler is also smarter in several ways: @behaviour declarations no longer add compile-time dependencies and aliases in patterns and guards add no dependency whatsoever, as no dispatching happens. Furthermore, Mix now tracks the digests of @external_resource files, reducing the amount of recompilation when swapping branches. Finally, dependencies are automatically recompiled when their compile-time configuration changes, providing a smoother development experience.

Potential incompatibilities

Due to the code path pruning, if you have an application or dependency that does not specify its dependencies on Erlang/OTP and core Elixir applications, which has always been erroneus behaviour, it may no longer compile successfully in Elixir v1.15. You can temporarily disable code path pruning by setting prune_code_paths: false in your mix.exs, although doing so may lead to runtime bugs that are only manifested inside a mix release.

Compiler warnings and errors

The Elixir compiler can now emit many errors for a single file, making sure more feedback is reported to developers before compilation is aborted.

In Elixir v1.14, an undefined function would be reported as:

** (CompileError) undefined function foo/0 (there is no such import)

In Elixir v1.15, the new reports will look like:

error: undefined function foo/0 (there is no such import)

** (CompileError) my_file.exs: cannot compile file (errors have been logged)

A new function, called Code.with_diagnostics/2, has been added so this information can be leveraged by editors, allowing them to point to several errors at once. We have currently ongoing work and contribution to further improve the compiler diagnostics in future Elixir releases.

Potential incompatibilities

As part of this effort, the behaviour where undefined variables were transformed into nullary function calls, often leading to confusing error reports, has been disabled during project compilation. You can invoke Code.compiler_options(on_undefined_variable: :warn) at the top of your mix.exs to bring the old behaviour back.

Integration with Erlang/OTP logger

This release provides additional features such as global logger metadata and file logging (with rotation and compression) out of the box!

This release also soft-deprecates Elixir’s Logger Backends in favor of Erlang’s Logger handlers. Elixir will automatically convert your :console backend configuration into the new configuration. Previously, you would set:

config :logger, :console,
  level: :error,
  format: "$time $message $metadata"

Which is now translated to the equivalent:

config :logger, :default_handler,
  level: :error

config :logger, :default_formatter,
  format: "$time $message $metadata"

To replace the default console handler by one that writes to disk, with log rotation and compression:

config :logger, :default_handler,
  config: [
    file: ~c"system.log",
    filesync_repeat_interval: 5000,
    file_check: 5000,
    max_no_bytes: 10_000_000,
    max_no_files: 5,
    compress_on_rotate: true

Finally, the previous Logger Backends API is now soft-deprecated. If you implement your own backends, you want to consider migrating to :logger_backends in the long term. See the new Logger documentation for more information on the new features and compatibility.

Learn more

For a complete list of all changes, see the full release notes.

Check the Install section to get Elixir installed and read our Getting Started guide to learn more.

Happy compiling!


Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.