Oban Beyond the Basics | Joseph Yiasemides
Erlang/OTP 28.0-rc1 is the first release candidate of three before the OTP 28.0 release.
The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue at https://github.com/erlang/otp/issues or by posting to Erlang Forums.
All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-16.0-rc1/doc. You can also install the latest release using kerl like this:
kerl build 28.0-rc1 28.0-rc1.
Starting with this release, a source Software Bill of Materials (SBOM) will describe the release on the Github Releases page. We welcome feedback on the SBOM.
Erlang/OTP 28 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.
Many thanks to all contributors!
Comprehensions have been extended with “zip generators” allowing
multiple generators to be run in parallel. For example,
[A+B || A <- [1,2,3] && B <- [4,5,6]]
will produce [5,7,9]
.
Generators in comprehensions can now be strict, meaning that if the generator pattern does not match, an exception will be raised instead of silently ignore the value that didn’t match.
It is now possible to use any base for floating point numbers as per EEP 75: Based Floating Point Literals.
For certain types of errors, the compiler can now suggest
corrections. For example, when attempting to use variable A
that
is not defined but A0
is, the compiler could emit the following
message: variable 'A' is unbound, did you mean 'A0'?
The size of an atom in the Erlang source code was limited to 255 bytes in previous releases, meaning that an atom containing only emojis could contain only 63 emojis. While atoms are still only allowed to contain 255 characters, the number of bytes is no longer limited.
The warn_deprecated_catch
option enables warnings for use of
old-style catch expressions on the form catch Expr
instead of the
modern try
… catch
… end
.
Provided that the map argument for a maps:put/3
call is known to
the compiler to be a map, the compiler will replace such calls with
the corresponding update using the map syntax.
Some BIFs with side-effects (such as binary_to_atom/1
) are
optimized in try
… catch
in the same way as guard BIFs in
order to gain performance.
The compiler’s alias analysis pass is now both faster and less conservative, allowing optimizations of records and binary construction to be applied in more cases.
The trace:system/3
function has been added. It has a similar
interface as erlang:system_monitor/2
but it also supports trace
sessions.
os:set_signal/2
now supports setting handlers for the SIGWINCH
,
SIGCONT
, and SIGINFO
signals.
The two new BIFs erlang:processes_iterator/0
and
erlang:process_next/1
make it possible to iterate over the
process table in a way that scales better than erlang:processes/0
.
The erl -noshell mode has been updated to have two sub modes called
raw
and cooked
, where cooked
is the old default behaviour and
raw
can be used to bypass the line-editing support of the native
terminal. Using raw
mode it is possible to read keystrokes as they
occur without the user having to press Enter. Also, the raw
mode
does not echo the typed characters to stdout.
The shell now prints a help message explaining how to interrupt a running command when stuck executing a command for longer than 5 seconds.
The join(Binaries, Separator)
function that joins a list of
binaries has been added to the
binary
module.
By default, sets created by module
sets
will now be represented as maps.
Module re
has
been updated to use the newer PCRE2 library instead of the
PCRE library.
There is a new zstd
module that does
Zstandard compression.
indent-region
in Emacs command will now handle multiline
strings better.For more details about new features and potential incompatibilities see the README.
Welcome to our series of case studies about companies using Elixir in production.
Remote is the everywhere employment platform enabling companies to find, hire, manage, and pay people anywhere across the world.
Founded in 2019, they reached unicorn status in just over two years and have continued their rapid growth trajectory since.
Since day zero, Elixir has been their primary technology. Currently, their engineering organization as a whole consists of nearly 300 individuals.
This case study focuses on their experience using Elixir in a high-growth environment.
Marcelo Lebre, co-founder and president of Remote, had worked with many languages and frameworks throughout his career, often encountering the same trade-off: easy-to-code versus easy-to-scale.
In 2015, while searching for alternatives, he discovered Elixir. Intrigued, Marcelo decided to give it a try and immediately saw its potential. At the time, Elixir was still in its early days, but he noticed how fast the community was growing, with support for packages and frameworks starting to show up aggressively.
In December 2018, when Marcelo and his co-founder decided to start the company, they had to make a decision about the technology that would support their vision. Marcelo wanted to prioritize building a great product quickly without worrying about scalability issues from the start. He found Elixir to be the perfect match:
I wanted to focus on building a great product fast and not really worry about its scalability. Elixir was the perfect match—reliable performance, easy-to-read syntax, strong community, and a learning curve that made it accessible to new hires.
- Marcelo Lebre, Co-founder and President
The biggest trade-off Marcelo identified was the smaller pool of Elixir developers compared to languages like Ruby or Python. However, he quickly realized that the quality of candidates more than made up for it:
The signal-to-noise ratio in the quality of Elixir candidates was much higher, which made the trade-off worthwhile.
- Marcelo Lebre, Co-founder and President
Remote operates primarily with a monolith, with Elixir in the backend and React in the front-end.
The monolith enabled speed and simplicity, allowing the team to iterate quickly and focus on building features. However, as the company grew, they needed to invest in tools and practices to manage almost 180 engineers working in the same codebase.
One practice was boundary enforcement. They used the Boundary library to maintain strict boundaries between modules and domains inside the codebase.
Another key investment was optimizing their compilation time in the CI pipeline. Since their project has around 15,000 files, compiling it in every build would take too long. So, they implemented incremental builds in their CI pipeline, recompiling only the files affected by changes instead of the entire codebase.
I feel confident making significant changes in the codebase. The combination of using a functional language and our robust test suite allows us to keep moving forward without too much worry.
- André Albuquerque, Staff Engineer
Additionally, as their codebase grew, the Elixir language continued to evolve, introducing better abstractions for developers working with large codebases. For example, with the release of Elixir v1.11, the introduction of config/runtime.exs provided the Remote team with a better foundation for managing configuration. This enabled them to move many configurations from compile-time to runtime, significantly reducing unnecessary recompilations caused by configuration updates.
One might expect Remote’s infrastructure to be highly complex, given their global scale and the size of their engineering team. Surprisingly, their setup remains relatively simple, reflecting a thoughtful balance between scalability and operational efficiency.
Remote runs on AWS, using EKS (Elastic Kubernetes Service). The main backend (the monolith) operates in only five pods, each with 10 GB of memory. They use Distributed Erlang to connect the nodes in their cluster, enabling seamless communication between processes running on different pods.
For job processing, they rely on Oban, which runs alongside the monolith in the same pods.
Remote also offers a public API for partners. While this API server runs separately from the monolith, it is the same application, configured to start a different part of its supervision tree. The separation was deliberate, as the team anticipated different load patterns for the API and wanted the flexibility to scale it independently.
The database setup includes a primary PostgreSQL instance on AWS RDS, complemented by a read-replica for enhanced performance and scalability. Additionally, a separate Aurora PostgreSQL instance is dedicated to storing Oban jobs. Over time, the team has leveraged tools like PG Analyze to optimize performance, addressing bottlenecks such as long queries and missing indexes.
This streamlined setup has proven resilient, even during unexpected spikes in workload. The team shared an episode where a worker’s job count unexpectedly grew by two orders of magnitude. Remarkably, the system handled the increase seamlessly, continuing to run as usual without requiring any design changes or manual intervention.
We once noticed two weeks later that a worker’s load had skyrocketed. But the scheduler worked fine, and everything kept running smoothly. That was fun.
- Alex Naser, Staff Engineer
Around 90% of their backend team works in the monolith, while the rest work in a few satellite services, also written in Elixir.
Within the monolith, teams are organized around domains such as onboarding, payroll, and billing. Each team owns one or multiple domains.
To streamline accountability in a huge monolith architecture, Remote invested heavily in team assignment mechanisms.
They implemented a tagging system that assigns ownership down to the function level. This means any trace—whether sent to tools like Sentry or Datadog—carries a tag identifying the responsible team. This tagging also extends to endpoints, allowing teams to monitor their areas effectively and even set up dashboards for alerts, such as query times specific to their domain.
The tagging system also simplifies CI workflows. When a test breaks, it’s automatically linked to the responsible team based on the Git commit. This ensures fast issue identification and resolution, removing the need for manual triaging.
Remote’s hiring approach prioritizes senior engineers, regardless of their experience with Elixir.
During the hiring process, all candidates are required to complete a coding exercise in Elixir. For those unfamiliar with the language, a tailored version of the exercise is provided, designed to introduce them to Elixir while reflecting the challenges they would face if hired.
Once hired, new engineers are assigned an engineering buddy to guide them through the onboarding process.
For hires without prior Elixir experience, Remote developed an internal Elixir training camp, a curated collection of best practices, tutorials, and other resources to introduce new hires to the language and ecosystem. This training typically spans two to four weeks.
After completing the training, engineers are assigned their first tasks—carefully selected tickets designed to build confidence and familiarity with the codebase.
Remote’s journey highlights how thoughtful technology, infrastructure, and team organization decisions can support rapid growth.
By leveraging Elixir’s strengths, they built a monolithic architecture that balanced simplicity with scalability. This approach allowed their engineers to iterate quickly in the early stages while effectively managing the complexities of a growing codebase.
Investments in tools like the Boundary library and incremental builds ensured their monolith remained efficient and maintainable even as the team and codebase scaled dramatically.
Remote’s relatively simple infrastructure demonstrates that scaling doesn’t always require complexity. Their ability to easily handle unexpected workload spikes reflects the robustness of their architecture and operational practices.
Finally, their focus on team accountability and streamlined onboarding allowed them to maintain high productivity while integrating engineers from diverse technical backgrounds, regardless of their prior experience with Elixir.
Elixir v1.18 is an impressive release with improvements across the two main efforts happening within the Elixir ecosystem right now: set-theoretic types and language servers. It also comes with built-in JSON support and adds new capabilities to its unit testing library. Let’s go over each of those in detail.
There are several updates in the typing department, so let’s break them down.
There is an on-going research and development effort to bring static types to Elixir. Elixir’s type system is:
sound - the types inferred and assigned by the type system align with the behaviour of the program
gradual - Elixir’s type system includes the dynamic()
type, which can be used when the type of a variable or expression is checked at runtime. In the absence of dynamic()
, Elixir’s type system behaves as a static one
developer friendly - the types are described, implemented, and composed using basic set operations: unions, intersections, and negation (hence it is a set-theoretic type system)
More interestingly, you can compose dynamic()
with any type. For example, dynamic(integer() or float())
means the type is either integer()
or float()
at runtime. This allows the type system to emit warnings if none of the types are satisfied, even in the presence of dynamism.
Elixir v1.17 was the first release to incorporate the type system in the compiler. In particular, we have added support for primitive types (integer, float, binary, pids, references, ports), atoms, and maps. We also added type checking to a handful of operations related to those types, such as accessing fields in maps, as in user.adress
(mind the typo), performing structural comparisons between structs, as in my_date < ~D[2010-04-17]
, etc.
The most exciting change in Elixir v1.18 is type checking of function calls, alongside gradual inference of patterns and return types. To understand how this will impact your programs, consider the following code defined in lib/user.ex
:
defmodule User do
defstruct [:age, :car_choice]
def drive(%User{age: age, car_choice: car}, car_choices) when age >= 18 do
if car in car_choices do
{:ok, car}
else
{:error, :no_choice}
end
end
def drive(%User{}, _car_choices) do
{:error, :not_allowed}
end
end
Elixir’s type system will infer the drive
function expects a User
struct as input and returns either {:ok, dynamic()}
or {:error, :no_choice}
or {:error, :not_allowed}
. Therefore, the following code
User.drive({:ok, %User{}}, car_choices)
will emit a warning stating that we are passing an invalid argument:
Now consider the expression below. We are expecting the User.drive/2
call to return :error
, which cannot possibly be true:
case User.drive(user, car_choices) do
{:ok, car} -> car
:error -> Logger.error("User cannot drive")
end
Therefore the code above would emit the following warning:
Our goal is for the warnings to provide enough contextual information that lead to clear reports and that’s an area we are actively looking for feedback. If you receive a warning that is unclear, please open up a bug report.
Elixir v1.18 also augments the type system with support for tuples and lists, plus type checking of almost all Elixir language constructs, except for
-comprehensions, with
, and closures. Here is a non-exaustive list of the new violations that can be detected by the type system:
if you define a pattern that will never match any argument, such as def function(x = y, x = :foo, y = :bar)
matching or accessing tuples at an invalid index, such as elem(two_element_tuple, 2)
if you have a branch in a try
that will never match the given expression
if you have a branch in a cond
that always passes (except the last one) or always fails
if you attempt to use the return value of a call to raise/2
(which by definition returns no value)
In summary, this release takes us further in our journey of providing type checking and type inference of existing Elixir programs, without requiring Elixir developers to explicitly add type annotations.
For existing codebases with reasonable code coverage, most type system reports will come from uncovering dead code - code which won’t ever be executed - as seen in a few distinct projects. A notable example is the type system ability to track how private functions are used throughout a module and then point out which clauses are unused:
defmodule Example do
def public(x) do
private(Integer.parse(x))
end
defp private(nil), do: nil
defp private("foo"), do: "foo"
defp private({int, _rest}), do: int
defp private(:error), do: 0
defp private("bar"), do: "bar"
end
Keep in mind the current implementation does not perform type inference of guards yet, which is an important source of typing information in programs. There is a lot the type system can learn about our codebases, that it does not yet. This brings us to the next topic.
The next Elixir release should improve the typing of maps, tuples, and closures, allowing us to type even more constructs. We also plan to fully type the with
construct, for
-comprehensions, as well as protocols.
But more importantly, we want to focus on complete type inference of guards, which in turn will allow us to explore ideas such as redundant pattern matching clauses and exhaustiveness checks. Our goal with inference is to strike the right balance between developer experience, compilation times, and the ability of finding provable errors in existing codebases. You can learn more about the trade-offs we made for inference in our documentation.
Future Elixir versions will introduce user-supplied type signatures, which should bring the benefits of a static type system without relying on inference. Check our previous article on the overall milestones for more information.
The type system was made possible thanks to a partnership between CNRS and Remote. The development work is currently sponsored by Fresha (they are hiring!), Starfish*, and Dashbit.
Three months ago, we welcomed the Official Language Server team, with the goal of unifying the efforts behind code intelligence, tools, and editors in Elixir. Elixir v1.18 brings new features on this front by introducing locks and listeners to its compilation. Let’s understand what it means.
At the moment, all language server implementations have their own compilation environment. This means that your project and dependencies during development are compiled once, for your own use, and then again for the language server. This duplicate effort could cause the language server experience to lag, when it could be relying on the already compiled artifacts of your project.
This release addresses the issue by introducing a compiler lock, ensuring that only a single operating system running Elixir compiles your project at a given moment, and by providing the ability for one operating system process to listen to the compilation results of others. In other words, different Elixir instances can now communicate over the same compilation build, instead of racing each other.
These enhancements do not only improve editor tooling, but they also directly benefit projects like IEx and Phoenix. Here is a quick snippet showing how to enable auto-reloading inside IEx, then running mix compile
in one shell automatically reloads the module inside the IEx session:
Erlang/OTP 27 added built-in support for JSON and we are now bringing it to Elixir. A new module, called JSON
, has been added with functions to encode and decode JSON. Its most basic APIs reflect the ones from the Jason project (the de-facto JSON library in the Elixir community up to this point).
A new protocol, called JSON.Encoder
, is also provided for those who want to customize how their own data types are encoded to JSON. You can also derive protocols for structs, with a single-line of code:
@derive {JSON.Encoder, only: [:id, :name]}
defstruct [:id, :name, :email]
The deriving API mirrors the one from Jason
, helping those who want to migrate to the new JSON
module.
ExUnit now supports parameterized tests. This allows your test modules to run multiple times under different parameters.
For example, Elixir ships a local, decentralized and scalable key-value process storage called Registry
. The registry can be partitioned and its implementation differs depending if partitioning is enabled or not. Therefore, during tests, we want to ensure both modes are exercised. With Elixir v1.18, we can achieve this by writing:
defmodule Registry.Test do
use ExUnit.Case,
async: true,
parameterize: [
%{partitions: 1},
%{partitions: 8}
]
# ... the actual tests ...
end
Once specified, the number of partitions is available as part of the test configuration. For example, to start one registry per test with the correct number of partitions, you can write:
setup config do
partitions = config.partitions
name = :"#{config.test}_#{partitions}"
opts = [keys: :unique, name: name, partitions: partitions]
start_supervised!({Registry, opts})
opts
end
Prior to parameterized tests, Elixir resorted on code generation, which increased compilation times. Furthermore, ExUnit parameterizes the whole test modules, which also allows the different parameters to run concurrently if the async: true
option is given. Overall, this features allows you to compile and run multiple scenarios more efficiently.
Finally, ExUnit also comes with the ability of specifying test groups. While ExUnit supports running tests concurrently, those tests must not have shared state between them. However, in large applications, it may be common for some tests to depend on some shared state, and other tests to depend on a completely separate state. For example, part of your tests may depend on Cassandra, while others depend on Redis. Prior to Elixir v1.18, these tests could not run concurrently, but in v1.18 they might as long as they are assigned to different groups:
defmodule MyApp.PGTest do
use ExUnit.Case, async: true, group: :pg
# ...
end
Tests modules within the same group do not run concurrently, but across groups, they might.
With features like async tests, suite partitioning, and now grouping, Elixir developers have plenty of flexibility to make the most use of their machine resources, both in development and in CI.
mix format --migrate
The mix format
command now supports an explicit --migrate
flag, which will convert constructs that have been deprecated in Elixir to their latest version. Because this flag rewrites the AST, it is not guaranteed the migrated format will always be valid when used in combination with macros that also perform AST rewriting.
As of this release, the following migrations are executed:
Normalize parens in bitstring modifiers - it removes unnecessary parentheses in known bitstring modifiers, for example <<foo::binary()>>
becomes <<foo::binary>>
, or adds parentheses for custom modifiers, where <<foo::custom_type>>
becomes <<foo::custom_type()>>
.
Charlists as sigils - formats charlists as ~c
sigils, for example 'foo'
becomes ~c"foo"
.
unless
as negated if
s - rewrites unless
expressions using if
with a negated condition, for example unless foo do
becomes if !foo do
. We plan to deprecate unless
in future releases.
More migrations will be added in future releases to help us push towards more consistent codebases.
Other notable changes include PartitionSupervisor.resize!/2
, for resizing the number of partitions (aka processes) of a supervisor at runtime, Registry.lock/3 for simple in-process key locks, PowerShell versions of elixir
and elixirc
scripts for better DX on Windows, and more. See the CHANGELOG for the complete release notes.
Happy coding!
The Erlang Port Mapper Daemon (EPMD) is a built-in component that helps Erlang-based applications (including RabbitMQ) discover each other’s distribution ports for clustering. Although EPMD itself isn’t directly dangerous, its exposure on the public internet often signals that Erlang Distribution ports are also exposed. This creates a serious security risk: if attackers find these distribution ports, they can potentially join your cluster, run arbitrary code, and compromise your systems. Recent scans have revealed over 85,000 instances of publicly accessible EPMD, with roughly half associated with RabbitMQ servers.
If left unsecured, exposed Erlang Distribution ports let attackers gain a foothold in your system. Fortunately, mitigation steps are straightforward: disable Erlang Distribution if you’re not clustering, or restrict it behind a firewall and proper network configuration—and ensure Erlang Distribution is never exposed to untrusted networks.
Read the full article on the EEF blog.
Erlang/OTP 27.2 is the second maintenance patch package for OTP 27, with mostly bug fixes as well as improvements.
full_result
request option
when returning an asynchronous request.For details about bugfixes and potential incompatibilities see the Erlang 27.2 README
The Erlang/OTP source can also be found at GitHub on the official Erlang repository, https://github.com/erlang/otp
Download links for this and previous versions are found here
I like to think that I write code deliberately. I’m an admittedly slow developer, and I want to believe I do so on purpose. I want to know as much as I can about the context of what it is that I'm automating. I also use a limited set of tools. I used old computers for a long time, both out of an environmental mindset, but also because a slower computer quickly makes it obvious when something scales poorly.1
The idea is to seek friction, and harness it as an early signal that whatever I’m doing may need to be tweaked, readjusted. I find this friction, and even frustration in general to also be useful around learning approaches.2
In opposition to the way I'd like to do things, everything about the tech industry is oriented towards elevated productivity, accelerated growth, and "easy" solutions to whole families of problems.
I feel that maybe we should teach people to program the way they teach martial arts, like only in the most desperate situations when all else failed should you resort to automating something. I don’t quite know if I’m just old and grumpy, seeing industry trends fly by me at a pace I don’t follow, or whether there’s really something to it, but I thought I’d take a walk through a set of ideas and concepts that motivate my stance.
This blog post has a lot of ground to cover. I'll first start with some fundamental properties of systems and how overload propagates through various bottlenecks. Then I'll go over some high-level pressures that are shared by most organizations and force trade-offs down their structure. These two aspects—load propagation and pervasive trade-offs—create the need for compensatory actions, of which we'll discuss some limits. This, finally, will be tied back to friction and ways to listen to it, because it's one of the things that underpins adaptation and keeps systems running.
Optimizing a frictional path without revising the system’s conditions and pressures tends to not actually improve the system. Instead, what you’re likely to do is surface brittleness in all the areas that are now exposed to the new system demands. Whether a bottleneck was invisible or well monitored, and regardless of scale, it offered an implicit form of protection that was likely taken for granted.
For a small scale example, imagine you run a small bit of software on a server, talking to a database. If you suddenly get a lot of visits, simply autoscaling the web front-end will likely leave the database unprotected and sensitive to tipping over (well, usually after having grown the connection pool, raised the connection limit, vertically scaled the servers, and so on). None of this will let you serve heavy traffic at a reasonable price until you rework your caching and data distribution strategy. Building for orders of magnitude more traffic than usual requires changing some fundamental aspects of your solution.
Similar patterns can be seen at a larger scale. An interesting case was the Clarkesworld magazine; as LLMs made it possible to produce slop at a faster rate than previously normal, an inherent bottleneck in authorship ("writing a book takes significant time and effort") was removed, leading to so much garbage that the magazine had to stop taking in submissions. They eventually ended up bearing the cost of creating a sort of imperfect queuing "spam filter" for submissions in order to accept them again. They don't necessarily publish more stories than before, they still aim to publish the good human-written stuff, there's just more costly garbage flowing through the system.3
A similar case to look for is how doctors in the US started using generative AI to fight insurance claim denials. Of course, insurers are now expected to adopt the same technology to counteract this effect. A general issue at play here is that the private insurance system's objectives and priorities are in conflict with those of the doctors and patients. Without realigning them, most of what we can expect is an increase in costs and technological means to get the same results out of it. People who don’t or can’t use the new tools are going to be left behind.
The optimization's benefit is temporary, limited, and ultimately lost in the overall system, which has grown more complex and possibly less accessible.4
I think LLMs are top of mind for people because they feel like a shift in how you automate. The common perspective is that machines are good at repetitive, predictable, mechanical tasks, and that solutions always suffered when it came to the fuzzy, unpredictable, and changing human-adjacent elements. LLMs look exactly the opposite of that: the computers can't do math very well anymore, but they seem to hold conversations and read intent much better. They therefore look like a huge opportunity to automate more of the human element and optimize it away, following well-established pressures and patterns. Alternatively, they seemingly increase the potential for new tools that could be created and support people in areas where none existed before.
The issues I'm discussing here clearly apply to AI, Machine Learning, and particularly LLMs. But they also are not specific to them. People who love the solution more than they appreciate the problem risk delivering clumsy integrations that aren’t really fit for purpose. This is why it feels like companies are wedging more AI in our face; that's what the investors wanted in order to signal innovativeness, or because the engineers really wanted to build cool shit, rather than solving the problems the users wanted or needed solved. The challenges around automation were always there from their earliest days and keep being in play now. They remain similar without regards to the type of automation or optimization being put in place, particularly if the system around them does not reorganize itself.
The canonical example here is what happens when an organization looms so large that people can't understand what is going on. The standard playbook around this is to start driving purely by metrics, which end up compressing away rich phenomena. Doing so faster, whether it is by gathering more data (even if we already had too much) or by summarizing harder via a LLM likely won't help run things better. Summaries, like metrics, are lossy compression. They're also not that different from management by PowerPoint slides, which we've seen cause problems in the space program, as highlighted by the Columbia report:
As information gets passed up an organization hierarchy, from people who do analysis to mid-level managers to high-level leadership, key explanations and supporting information is filtered out. In this context, it is easy to understand how a senior manager might read this PowerPoint slide and not realize that it addresses a life-threatening situation.
At many points during its investigation, the Board was surprised to receive similar presentation slides from NASA officials in place of technical reports. The Board views the endemic use of PowerPoint briefing slides instead of technical papers as an illustration of the problematic methods of technical communication at NASA.
There is no reason to think that overly aggressive summarization via PowerPoint, LLM, or metrics would not all end similarly. If your decision-making layer cannot deal with the amount of information required to centrally make informed decisions, there may be a point where the solution is to change the system's structure (and decentralize, which has its own pitfalls) rather than to optimize the existing paths without question.5
Every actor, component, or communication channel in a system has inherent limits. Any part that suddenly becomes faster or more productive without feedback shifts greater burdens onto other parts. These other parts must adapt, adjust, pass on the cost, or stop meeting expectations. Eliminating friction from one part of the system sometimes just shifts it around. System problems tend to remain system problems regardless of how much you optimize isolated portions of them.
How can we know what is worth optimizing, and what is changing at a more structural level?6 It helps to have an idea of where the pressures that create goal conflicts might come from, since they eventually lead to adaptations. Systems tend to continually be stretched to the limit of their capacity, and any improvement is instantly leveraged to accelerate the pace of existing activities.
This is usually where online people say things like "the root cause is capitalism"7—you shouldn't expect local solutions to fix systemic problems in the long term. The moment other players dynamically reduce their margins of maneuver to gain efficiency, you become relatively less competitive. You can think of how we could all formally prove software to be safe before shipping it, but instead we’ll compromise by using less formal methods like type analysis, tests, or feature flags to deliver acceptable products at much lower costs—both financial and cognitive. Be late to the market and you suffer, so there's a constant drive to ship faster and course-correct often.
People more hopeful or trusting of a system try to create and apply counteracting forces to maintain safe operating margins. This tends to be done through changing incentives, creating regulatory bodies, and implementing better control and reporting mechanisms. This is often the approach you'll see taken around the nuclear industry, the FAA and the aviation industry, and so on. However, there are also known patterns (such as regulatory capture) that tend to erode these mechanisms, and even within each of these industries, surprises and adaptations are still a regular occurrence.
Ultimately, the effects of any technological change are rather unpredictable. Designing for systems where experts operate demands constantly revisiting and iterating. The concepts we define to govern systems create their own indifference to other important perspectives, and data-driven approaches carry the risk of "bias laundering" mechanisms that repeat and amplify existing flaws in the system.
Other less predictable effects can happen. Adopting objectively more accurate algorithms can create monocultures in decision-making, which can interact such that the overall system efficiency can go down compared to more diverse environments—even in the absence of disruption.
Basically, the need for increased automation isn't likely to "normalize" a system and make it more predictable. It tends to just create new types of surprises in a way that does not remove the need for adaptation nor shift pressures; it only transforms them and makes them dynamic.
Embedded deeply in our view of systems is an assumption that things are stable until they are disrupted. It’s possibly where ideas like “root cause” gain their charisma: identify the one triggering disruptor (or its underlying mechanism) and then the system will be stable again. It’s conceptually a bit Newtonian in that if no force is applied, nothing will change.
A more ecological stance would instead assume that any perceived stability (while maintaining function) requires ongoing dynamic adjustments. The system is always decaying, transforming, interacting, changing. Stop interfering with it and it will eventually reach stability (without maintaining function) by breaking down or failing. If the pressures are constant and shifting as well as the counteracting mechanisms, we can assume that evolution and adaptation are required to deal with this dynamism. Over time, we should expect that the system instead evolves into a shape that fits its burdens while driven by scarcity and efficiency.
A risk in play here is that an ecosystem's pressures make it rational and necessary for all actors to optimize when they’re each other’s focal point—rather than some environmental condition. The more aggressively it is done, the more aggressively it is needed by others to stay in the game.
Robust yet fragile is the nature of systems that are well optimized for their main use cases and competitive within their environment, but which become easily upended by pressures applied from unexpected angles (that are therefore unprotected, since resources were used elsewhere instead).
Good examples of this are Just-In-Time supply chains being far more efficient than traditional ones, but being far easier to disrupt in times of disasters or pandemics. Most buffers in the supply chain (such as stock held in warehouses) had been replaced by more agile and effective production and delivery mechanisms. Particularly, the economic benefits (in stable times) and the need for competitiveness have made it tricky for many businesses not to rely on them.
The issue with optimizations driven from systemic pressures is that as you look at trimming the costs of keeping a subsystem going in times of stability, you may notice decent amounts of slack capacity that you could get rid of or drive harder in order to be more competitive in your ecosystem. That’s often resources that resilience efforts draw on to keep adapting and evolving.
Another form of rationalization in systems is one where rather than cutting "excess", the adoption and expansion of (software) platforms are used to drive economies of scale. Standardization and uniformization of patterns, methods, and processes is a good way to get more bang for your buck on an investment, to do more with less. Any such platform is going to have some things it gives its users for cheap, and some things that become otherwise challenging to do.8 Friction felt here can both be caused by going against the platform's optimal use cases or by the platform not properly supporting some use cases—it's a signal worth listening to.
In fact, we can more or less assume that friction is coming from everywhere because it's connected to these pressures. They just happen to be pervasive, at every layer of abstraction. If we had infinite time, infinite resources, or infinite capacity, we'd never need to optimize a thing.
Successfully navigating these pressures is essentially drawing from concepts such as graceful extensibility and sustained adaptability. In a nutshell, we're looking to know how systems stretch themselves to deal with disruptions and surprises in a context of finite resources, and also how a system manages and regulates its own abilities to do that on an ongoing basis. Remember that every actor or component of a system has inherent limits. This is also true of our ability to know what is going on, something known as local rationality.
This means that even if we're really hoping we could intervene from the system level first and avoid the (sometimes deceptively ineffective) local optimizations, it will regardless be attempted through local efforts. Knowing and detecting the friction behind it is useful for whoever wants the broader systematic view to act earlier, but large portions of the system are going to remain dynamic and co-evolving from locally felt pains and friction. Local rationality impacts everyone, even the most confident of system thinkers.
Friction shifts are unavoidable, so it's useful to also know of the ways in which they show up. Unfortunately, these shifts generally remain unseen from afar, because compensatory mechanisms and adaptation patterns hide them.9. So instead, it's more practical to find how to spot the compensatory patterns themselves.
One of the well-known mechanisms is the Efficiency–thoroughness trade-off (ETTO) principle, which states that since time and resources are limited, one has to trade-off efficiency and thoroughness to accomplish a task. Basically, if there's more work to do than there's capacity to do it, either you maintain thoroughness and the work accumulates or gets dropped, or you do work less thoroughly, possibly cut corners, accuracy, or you have to be less careful and keep going as fast as required.
This is also one of the patterns feeding concepts such as "deviance" (often used in normalization of deviance, although the term alone points to any variation relative to norms), where procedures and rules defining safe work start being modified or bent unofficially, until covert work patterns grow a gap between the work as it is specified and how it is practiced.10
Of course, another path is one of innovation, which can mean some reorganization or restructuring. We happen to be in tech, so we tend to prefer to increase capacity by using new technology. New technology is rarely neutral and never isolated. It disturbs established patterns—often on purpose, but sometimes in unexpected ways—can require a complex support system, and for everyone to adjust around it to maintain the proper operational context. Adding to this, if automation is clumsy enough, it won’t be used to its full potential to avoid distracting or burdening practitioners using it to do their work. The ongoing adaptations and trade-offs create potential risks and needs for reciprocity to anticipate and respond to new contingencies.
You basically need people who know the system, how it works, understand what is normal or abnormal, and how to work around its flaws. They are usually those who have the capacity to detect any sort of "creaking" in local parts of the system, who harness the friction and can then do some adjusting, mustering and creating slack to provide the margin to absorb surprises. They are compensating for weaknesses as they appear by providing adaptive capacity.
Some organizations may enjoy these benefits without fixing anything else by burning out employees and churning through workers, using them as a kind of human buffer for systemic stressors. This can sustain them for a while, but may eventually reach its limits.
Even without any sort of willful abuse, pressures lead a system to try to fully use or optimize away the spare capacity within. This can eventually exhaust the compensatory mechanisms it needs to function, leading to something called "decompensation".
Compensatory mechanisms are often called on so gradually that your average observer wouldn't even know it's taking place. Systems (or organisms) that appear absolutely healthy one day collapse, and we discover they were overextended for a long while. Let's look at congestive heart failure as an example.11
Effects of heart damage accumulate gradually over the years—partly just by aging—and can be offset by compensatory mechanisms in the human body. As the heart becomes weaker and pumps less blood with each beat, adjustments manage to keep the overall flow constant over time. This can be done by increasing the heart rate using complex neural and hormonal signaling.
Other processes can be added to this: kidneys faced with lower blood pressure and flow can reduce how much urine they create to keep more fluid in the circulatory system, which increases cardiac filling pressure, which stretches the heart further before each beat, which adds to the stroke volume. Multiple pathways of this kind exist through the body, and they can maintain or optimize cardiac performance.
However, each of these compensatory mechanisms has less desirable consequences. The heart remains damaged and they offset it, but the organism remains unable to generate greater cardiac output such as would be required during exercise. You would therefore see "normal" cardiac performance at rest, with little ability to deal with increased demand. If the damage is gradual enough, the organism will adjust its behavior to maintain compensation: you will walk slower, take breaks while climbing stairs, and will just generally avoid situations that strain your body. This may be done without even awareness of the decreased capacity of the system, and we may even resist acknowledging that we ever slowed down.
Decompensation happens when all the compensatory mechanisms no longer prevent a downward spiral. If the heart can't maintain its output anymore, other organs (most often the kidneys) start failing. A failing organ can't overextend itself to help the heart; what was a stable negative feedback loop becomes a positive feedback loop, which quickly leads to collapse and death.
Someone with a compensated congestive heart failure appears well and stable. They have gradually adjusted their habits to cope with their limited capacity as their heart weakened through life. However, looking well and healthy can hide how precarious of a position the organism is in. Someone in their late sixties skipping their heart medication for a few days or adopting a saltier diet could be enough to tip the scales into decompensation.
Decompensation usually doesn’t happen because compensation mechanisms fail, but because their range is exhausted. A system that is compensating looks fine until it doesn’t. That's when failures may cascade and major breakdowns occur. This applies to all sorts of systems, biological as well as sociotechnical.
A common example seen in the tech industry is one where overburdened teams continuously pull small miracles and fight fires, keeping things working through major efforts. The teams are stretched thin, nobody's been on vacation for a while, and hiring is difficult because nobody wants to jump into that sort of place. All you need is one extra incident, one person falling ill or quitting, needing to add one extra feature (which nobody has bandwidth to work on), and the whole thing falls apart.
But even within purely technical subsystems, automation reaching its limits often shows up a bit like decompensation when it hands control back to a human operator who doesn't have the capacity to deal with what is going on (one of the many things pointed out by the classic text on the Ironies of Automation). Think of an autopilot that disengages once it reached the limit of what it can do to stabilize a plane in hazardous conditions. Or of a cluster autoscaler that can no longer schedule more containers or hosts and starts crowding them until performance collapses, queues fill up, and the whole application becomes unresponsive.
Eventually, things spin out into a much bigger emergency than you'd have expected as everything appeared fine. There might have been subtle clues—too subtle to be picked up without knowing where to look—which shouldn't distract from their importance. Friction usually involves some of these indicators.
Going back to friction being useful feedback, the question I want to ask is: how can we keep listening? The most effective actions are systemic, but the friction patterns are often local. If we detect the friction, papering over it via optimization or brute-force necessarily keeps it local, and potentially ineffective. We need to do the more complex work of turning friction into a system-level feedback signal for it to have better chances of success and sustainability. We can't cover all the clues, but surfacing key ones can be critical for the system to anticipate surprises and foster broader adaptive responses.
When we see inappropriate outcomes of a system, we should be led to wonder what about its structure makes it a normal output. What are the externalities others suffer as a consequence of the system's strengths and weaknesses? This is a big question that feels out of reach for most, and not necessarily practical for everyday life. But it’s an important one as we repeatedly make daily decisions around trading off “working a bit faster” against the impacts of the tools we adopt, whether they are environmental, philosophical, or sociopolitical.
Closer to our daily work as developers, when we see code that’s a bit messy and hard to understand, we either slow down to create and repair that understanding, or patch it up with local information and move on. When we do this with a tool that manages the information for us, are we in a situation where we accelerate ourselves by providing better framing and structure, or one where we just get where we want without acknowledging the friction?12
If it's the latter, what are the effects of ignoring the friction? Are we creating technical debt that can’t be managed without the tools? Are we risking increasingly not reorganizing the system when it creaks, and only waiting to see obvious breaks to know it needs attention? In fact, how would you even become good at knowing what creaking sounds like if you just always slam through the hurdles?
Recognizing these patterns is a skill, and it tends to require knowing what “normal” feels like such that you can detect what is not there when you start deviating.13
If you use a bot for code reviews, ask yourself whether it is replacing people reviewing and eroding the process. Is it providing a backstop? Are there things it can't know about that you think are important? Is it palliating already missing support? Are the additional code changes dictated by review comments worth more than the acts of reviewing and discussing the code? Do you get a different result if the bot only reviews code that someone else already reviewed to add more coverage, rather than implicitly making it easier to ignore reviews and go fast?
Work that takes time is a form of friction, and it's therefore tempting to seek ways to make it go faster. Before optimizing it away, ask yourself whether it might have outputs other than its main outputs. Maybe you’re fixing a broken process for an overextended team. Maybe you’re eroding annoying but surprisingly important opportunities for teams to learn, synchronize, share, or reflect on their practices without making room for a replacement.
When you're reworking a portion of a system to make it more automatable, ask whether any of the facilitating and structuring steps you're putting in place could also benefit people directly. I recall hearing a customer who said “We are now documenting things in human-readable text so AI can make use of it”—an investment that clearly could have been worth it for people too. Use the change of perspective as an opportunity to surface elements hidden in the broader context and ecosystem, and on which people rely implicitly.
I've been disappointed by proposals of turning LLMs into incident reviewers; I'd rather see them becoming analysis second-guessers: maybe they can point out agentive language leading to bias, elements that sound counterfactual, highlights elements that appear blameful to create blame awareness?
If you make the decision to automate, still ask the questions and seek the friction. Systems adjust themselves and activate their adaptive capacity based on the type of challenges they face. Highlight friction. It’s useful, and it would be a waste to ignore it.
Thanks to Jordan Goodnough, Alan Kraft, and Laura Nolan for reviewing this text.
1: I’m forced to refresh my work equipment more often now because new software appears to hunger for newer hardware at an accelerating pace.
2: As a side note, I'd like to call out the difference between friction, where you feel resistance and that your progression is not as expected based on experience, and one of pain, where you're just making no progress at all and having a plain old bad time. I'd put "pain" in a category where you might feel more helpless, or do useless work just because that's how people first gained the experience without any good reason for it to still be learned the same today. Under this casual definition, friction is the unfamiliar feeling when getting used to your tools and seeking better ways of wielding them, and pain is injuring yourself because the tools have poor ergonomic properties.
3: the same problem can be felt in online book retail, where spammers started hijacking the names of established authors with fake books. The cost of managing this is left to authors—and even myself, having published mostly about Erlang stuff, have had at least two fake books published under my name in the last couple years.
4: In Energy and Equity, Ivan Illich proposes that societies built on high-speed motorized transportation create a "radical monopoly," basically stating that as the society grows around cars and scales its distances proportionally to time spent traveling, living without affording a car and its upkeep becomes harder and harder. This raises the bar of participation in such environments, and it's easy to imagine a parallel within other sociotechnical systems.
5: AI is charismatic technology. It is tempting to think of it as the one optimization that can make decisions such that the overall system remains unchanged while its outputs improve. Its role as fantasized by science fiction is one of an industrial supply chain built to produce constantly good decisions. This does not reduce its potential for surprise or risk. Machine-as-human-replacement is most often misguided. I don't believe we're anywhere that point, and I don't think it's quite necessary to make an argument about it.
6: Because structural changes often require a lot more time and effort than local optimizations, you sometimes need to carry both types of interventions at the same time: a piecemeal local optimization to "extend the runway", and broader interventions to change the conditions of the system. A common problem for sustainability is to assume that extending the runway forever is both possible and sufficient, and never follow up with broader acts.
7: While capitalism has a keen ability to drive constraints of this kind, scarcity constraints are fairly universal. For example, Sonja D. Schmid, in Producing Power illustrates that some of the contributing factors that encouraged the widespread use of the RBMK reactor design in the USSR—the same design used in Chernobyl—were that its manufacturing was more easily distributed over broad geographic areas and sourced from local materials which could avoid the planned system's inefficiencies, and therefore meet electrification objectives in ways that couldn't be done with competing (and safer) reactor designs. Additionally, competing designs often needed centralized manufacturing of parts that could then not be shipped through communist USSR without having to increase the dimensions of some existing train tunnels, forcing upgrades to its rail network to open power plants.
An entirely unrelated example is that a beehive's honeycomb structure optimizes for using the least material to create a lattice of cells within a given volume.
8: AWS or Kubernetes or your favorite framework all come with some real cool capabilities and also some real trade-offs. What they're built to do makes some things much easier, and some things much harder. Do note that when you’re building something for the first time on a schedule, prioritizing to deliver a minimal first set of features also acts as an inherent optimization phase: what you choose to build and leave for later fits that same trade-off pattern.
9: This is similar to something called the Law of Fluency, which states that well-adapted cognitive work occurs with a facility that belies the difficulty of resolving demands and balancing dilemmas. While the law of fluency works at the individual cognitive level, I tend to assume it also shows up at larger organizational or system levels as well.
10: Rule- and Role-retreat may also be seen when people get overloaded, but won't deviate or adjust their plans to new circumstances. This "failure to adapt" can also contribute to incidents, and is one of the reasons why some forms of deviations have to be considered positive for the system.
11: Most of the information in this section came from Dr. Richard I. Cook, explaining the concept in a group discussion, a few years before his passing.
12: this isn’t purely a tooling decision; you also make this type of call every time you choose to refactor code to create an abstraction instead of copy/pasting bits of it around.
13: I believe but can't prove that there's also a tenuous but real path between the small-scale frictions, annoyances, and injustices we can let slip, and how they can be allowed to propagate and grow in greater systemic scales. There's always tremendously important work done at the local level, where people bridge the gap between what the system orders and what the world needs. If there are paths leading the feedback up from the local, they are critical to keeping things aligned. I'm unsure what the links between them are, but I like to think that small adjustments made by people with agency are part of a negative feedback loop partially keeping things in check.
This post originally appeared on the LFI blog but I decided to post it on my own as well.
Every organization has to contend with limits: scarcity of resources, people, attention, or funding, friction from scaling, inertia from previous code bases, or a quickly shifting ecosystem. And of course there are more, like time, quality, effort, or how much can fit in anyone's mind. There are so many ways for things to go wrong; your ongoing success comes in no small part from the people within your system constantly navigating that space, making sacrifice decisions and trading off some things to buy runway elsewhere. From time to time, these come to a head in what we call a goal conflict, where two important attributes clash with each other.
These are not avoidable, and in fact are just assumed to be so in many cases, such as "cheap, fast, and good; pick two". But somehow, when it comes to more specific details of our work, that clarity hides itself or gets obscured by the veil of normative judgments. It is easy after an incident to think of what people could have done differently, of signals they should have listened to, or of consequences they would have foreseen had they just been a little bit more careful.
From this point of view, the idea of reinforcing desired behaviors through incentives, both positive (bonuses, public praise, promotions) and negative (demerits, recertification, disciplinary reviews) can feel attractive. (Do note here that I am specifically talking of incentives around specific decision-making or performance, rather than broader ones such as wages, perks, overtime or hazard pay, or employment benefits, even though effects may sometimes overlap.)
But this perspective itself is a trap. Hindsight bias—where we overestimate how predictable outcomes were after the fact—and its close relative outcome bias—where knowing the results after the fact tints how we judge the decision made—both serve as good reminders that we should ideally look at decisions as they were being made, with the information known and pressures present then.
This is generally made easier by assuming people were trying to do a good job and get good results; a judgment that seems to make no sense asks of us that we figure out how it seemed reasonable at the time.
Events were likely challenging, resources were limited (including cognitive bandwidth), and context was probably uncertain. If you were looking for goal conflicts and difficult trade-offs, this is certainly a promising area in which they can be found.
Taking people's desire for good outcomes for granted forces you to shift your perspective. It demands you move away from thinking that somehow more pressure toward succeeding would help. It makes you ask what aid could be given to navigate the situation better, how the context could be changed for the trade-offs to be negotiated differently next time around. It lets us move away from wondering how we can prevent mistakes and move toward how we could better support our participants.
Hell, the idea of rewarding desired behavior feels enticing even in cases where your review process does not fall into the traps mentioned here, where you take a more just approach.
But the core idea here is that you can't really expect different outcomes if the pressures and goals that gave them rise don't change either.
During incidents, priorities in play already are things like "I've got to fix this to keep this business alive," stabilizing the system to prevent large cascades, or trying to prevent harm to users or customers. They come with stress, adrenalin, and sometimes a sense of panic or shock. These are likely to rank higher in the minds of people than “what’s my bonus gonna be?” or “am I losing a gift card or some plaque if I fail?”
Adding incentives, whether positive or negative, does not clarify the situation. It does not address goal conflicts. It adds more variables to the equation, complexifies the situation, and likely makes it more challenging.
Chances are that people will make the same decisions they would have made (and have been making continuously) in the past, obtaining the desired outcomes. Instead, they’ll change what they report later in subtle ways, by either tweaking or hiding information to protect themselves, or by gradually losing trust in the process you've put in place. These effects can be amplified when teams are given hard-to-meet abstract targets such as lowering incident counts, which can actively interfere with incident response by creating new decision points in people's mental flows. If responders have to discuss and classify the nature of an incident to fit an accounting system unrelated to solving it right now, their response is likely made slower, more challenging.
This is not to say all attempts at structure and classification would hinder proper response, though. Clarifying the critical elements to salvage first, creating cues and language for patterns that will be encountered, and agreeing on strategies that support effective coordination across participants can all be really productive. It needs to be done with a deeper understanding of how your incident response actually works, and that sometimes means unpleasant feedback about how people perceive your priorities.
I've been in reviews where people stated things like "we know that we get yelled at more for delivering features late than broken code so we just shipped broken code since we were out of time," or who admitted ignoring execs who made a habit of coming down from above to scold employees into fixing things they were pressured into doing anyway. These can be hurtful for an organization to consider, but they are nevertheless a real part of how people deal with exceptional situations.
By trying to properly understand the challenges, by clarifying the goal conflicts that arise in systems and result in sometimes frustrating trade-offs, and by making learning from these experiences an objective of its own, we can hopefully make things a bit better. Grounding our interventions within a richer, more naturalistic understanding of incident response and all its challenges is a small—albeit a critical one—part of it all.