What Should FinTech Learn From Telecom? Portfolio Conference, Banking Technology 2021
Erlang/OTP 27.0-rc2 is the second release candidate of three before the OTP 27.0 release.
The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue at https://github.com/erlang/otp/issues or by posting to Erlang Forums.
All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-15.0-rc2/doc. You can also install the latest release using kerl like this:
kerl build 27.0-rc2 27.0-rc2.
Erlang/OTP 27 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.
Many thanks to all contributors!
There is a new module json for encoding and decoding JSON.
Both encoding and decoding can be customized. Decoding can be done in a SAX-like fashion and handle multiple documents and streams of data.
The new json
module is used by the jer
(JSON Encoding Rules) for ASN.1
for encoding and decoding JSON. Thus, there is no longer any need to supply
an external JSON library.
The existing experimental support for archive files will be changed in a future release. The support for having an archive in an escript will remain, but the support for using archives in a release will either become more limited or completely removed.
As of Erlang/OTP 27, the function code:lib_dir/2
, the -code_path_choice
flag, and using erl_prim_loader
for reading members of an archive are
deprecated.
To remain compatible with future version of Erlang/OTP escript
scripts that
need to retrieve data files from its archive should use escript:extract/2
instead of erl_prim_loader
and code:lib_dir/2
.
The order in which the compiler looks up options has changed.
When there is a conflict in the compiler options given in the -compile()
attribute and options given to the compiler, the options given in the
-compile()
attribute overrides the option given to the compiler, which in
turn overrides options given in the ERL_COMPILER_OPTIONS
environment
variable.
Example:
If some_module.erl
has the following attribute:
-compile([nowarn_missing_spec]).
and the compiler is invoked like so:
% erlc +warn_missing_spec some_module.erl
no warnings will be issued for functions that do not have any specs.
configure
now automatically enables support for year-2038-safe timestamps.
By default configure
scripts used when building OTP will now try
to enable support for timestamps that will work after mid-January
If configure
cannot figure out how to enable such timestamps, it
will abort with an error message. If you want to build the system
anyway, knowing that the system will not function properly after
mid-January 2038, you can pass the --disable-year2038
option to
configure
, which will enable configure
to continue without
support for timestamps after mid-January 2038.
EEP-59 has been implemented. Documentation attributes in source files can now be used to document functions, types, callbacks, and modules.
The entire Erlang/OTP documentation is now using the new documentation system.
Triple-Quoted Strings has been implemented as per EEP 64 to allow a string to encompass a complete paragraph.
Adjacent string literals without intervening white space is now a syntax error, to avoid possible confusion with triple-quoted strings.
Sigils on string literals (both ordinary and triple-quoted) have
been implemented as per
EEP 66. For example,
~"Björn"
or ~b"Björn"
are now equivalent to <<"Björn"/utf8>>
.
The compiler will now merge consecutive updates of the same record.
Safe destructive update of tuples has been implemented in the compiler and runtime system. This allows the VM to update tuples in-place when it is safe to do so, thus improving performance by doing less copying but also by producing less garbage.
The maybe
expression is now enabled by default, eliminating the need
for enabling the maybe_expr
feature.
Native coverage support has been implemented in the JIT. It will
automatically be used by the cover
tool to reduce the execution
overhead when running cover-compiled code. There are also new APIs
to support native coverage without using the cover
tool.
The compiler will now raise a warning when updating record/map literals
to catch a common mistake. For example, the compiler will now emit a
warning for #r{a=1}#r{b=2}
.
The erl
command now supports the -S
flag, which is similar to
the -run
flag, but with some of the rough edges filed off.
By default, escripts will now be compiled instead of interpreted. That
means that the compiler
application must be installed.
The default process limit has been raised to 1048576
processes.
The erlang:system_monitor/2
functionality is now able to monitor long
message queues in the system.
The obsolete and undocumented support for opening a port to an external
resource by passing an atom (or a string) as first argument to
open_port()
, implemented by the vanilla driver,
has been removed. This feature has been scheduled for removal in OTP 27
since the release of OTP 26.
The pid
field has been removed from erlang:fun_info/1,2
.
Multiple trace sessions are now supported.
Several new functions that accept funs have been added to module timer
.
The functions is_equal/2
, map/2
, and filtermap/2
have been added to
the modules sets
, ordsets
, and gb_sets
.
There are new efficient ets
traversal functions with guaranteed atomicity.
For example, ets:next/2
followed by ets:lookup/2
can now be replaced
with ets:next_lookup/1
.
The new function ets:update_element/4
is similar to ets:update_element/3
,
but takes a default tuple as the fourth argument, which will be inserted
if no previous record with that key exists.
binary:replace/3,4
now supports using a fun for supplying the
replacement binary.
The new function proc_lib:set_label/1
can be used to add a descriptive
term to any process that does not have a registered name. The name will
be shown by tools such as c:i/0
and observer
, and it will be included
in crash reports produced by processes using gen_server
, gen_statem
,
gen_event
, and gen_fsm
.
Added functions to retrieve the next higher or lower key/element from
gb_trees
and gb_sets
, as well as returning iterators that start at
given keys/elements.
Calls to ct:capture_start/0
and ct:capture_stop/0
are now synchronous to
ensure that all output is captured.
The default CSS will now include a basic dark mode handling if it is preferred by the browser.
crypto_dyn_iv_init/3
and crypto_dyn_iv_update/3
that were marked as deprecated in Erlang/OTP 25 have been removed.--gui
option for Dialyzer has been removed.ssl
client can negotiate and handle certificate status request (OCSP
stapling support on the client side).tprof
, which combines the functionality of eprof
and cprof
under one interface. It also adds heap profiling.xmerl_xml
, a new export module xmerl_xml_indent
that provides out-of-the box indented output has been added.For more details about new features and potential incompatibilities see the README.
Recently, Hazel Weakly has published a great article titled Redefining Observability. In it, she covers competing classical definitions observability, weaknesses they have, and offers a practical reframing of the concept in the context of software organizations (well, not only software organizations, but the examples tilt that way).
I agree with her post in most ways, and so this blog post of mine is more of an improv-like “yes, and…” response to it, in which I also try to frame the many existing models as complementary or contrasting perspectives of a general concept.
The main points I’ll try to bring here are on the topics of the difference between insights and questions, the difference between observability and data availability, reinforcing a socio-technical definition, the mess of complex systems and mapping them, and finally, a hot take on the use of models when reasoning about systems.
The control theory definition of observability, from Rudolf E. Kálmán, goes as follows:
Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
The one from Woods and Hollnagel in cognitive engineering goes like this:
Observability is feedback that provides insight into a process and refers to the work needed to extract meaning from available data.
Hazel version’s, by comparison, is:
Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.
While all three definitions relate to extracting information from a system, Hazel’s definition sits at a higher level by specifically mentioning questions and actions. It’s a more complete feedback loop including some people, their mental models, and seeking to enrich them or use them.
I think that higher level ends up erasing a property of observability in the other definitions: it doesn’t have to be inquisitive nor analytical.
Let’s take a washing machine, for example. You can know whether it is being filled or spinning by sound. The lack of sound itself can be a signal about whether it is running or not. If it is overloaded, it might shake a lot and sound out of balance during the spin cycle. You don’t necessarily have to be in the same room as the washing machine nor paying attention to it to know things about its state, passively create an understanding of normalcy, and learn about some anomalies in there.
Another example here would be something as simple as a book. If you’re reading a good old paper book, you know you’re nearing the end of it just by how the pages you have read make a thicker portion of the book than those you haven’t read yet. You do not have to think about it, the information is inherent to the medium. An ebook read on an electronic device, however, will hide that information unless a design decision is made to show how many lines or words have been read, display a percentage, or a time estimate of the content left.
Observability for the ebook isn’t innate to its structure and must be built in deliberately. Similarly, you could know old PCs were doing heavy work if the disk was making noise and when the fan was spinning up; it is not possible to know as much on a phone or laptop that has an SSD and no fan unless someone builds a way to expose this data.
Associations and patterns can be formed by the observer in a way that provides information and explanations, leading to effective action and management of the system in play. It isn’t something always designed or done on purpose, but it may need to be.
The key element is that an insight can be obtained without asking questions. In fact, a lot of anomaly detection is done passively, by the observer having a sort of mental construct of what normal is that lets them figure out what should happen next—what the future trajectory of the system is—and to then start asking questions when these expectations are not met. The insights, therefore, can come before the question is asked. Observability can be described as a mechanism behind this.
I don’t think that this means Hazel’s definition is wrong; I think it might just be a consequence of the scale at which her definition operates. However, this distinction is useful for a segue into the difference between data availability and observability.
For a visual example, I’ll use variations on a painting from 1866 named In a Roman Osteria, by Danish painter Carl Bloch:
The first one is an outline, and the second is a jigsaw puzzle version (with all the pieces are right side up with the correct orientation, at least).
The jigsaw puzzle has 100% data availability. All of the information is there and you can fully reconstruct the initial painting. The outlined version has a lot less data available, but if you’ve never seen the painting before, you will get a better understanding from it in absolutely no time compared to the jigsaw.
This “make the jigsaw show you what you need faster” approach is more or less where a lot of observability vendors operate: the data is out there, you need help to store it and put it together and extract the relevancy out of it:
What this example highlights though is that while you may get better answers with richer and more accurate data (given enough time, tools, and skill), the outline is simpler and may provide adequate information with less effort required from the observer. Effective selection of data, presented better, may be more appropriate during high-tempo, tense situations where quick decisions can make a difference.
This, at least, implies that observability is not purely a data problem nor a tool problem (which lines up, again, with what Hazel said in her post). However, it hints at it being a potential design problem. The way data is presented, the way affordances are added, and whether the system is structured in a way that makes storytelling possible all can have a deep impact in how observable it turns out to be.
Sometimes, coworkers mention that some of our services are really hard to interpret even when using Honeycomb (which we build and operate!) My theory about that is that too often, the data we output for observability is structured with the assumption that the person looking at it will have the code on hand—as the author did when writing it—and will be able to map telemetry data to specific areas of code.
So when you’re in there writing queries and you don’t know much about the service, the traces mean little. As a coping mechanism, social patterns emerge where data that is generally useful is kept on some specific spans that are considered important, but that you can only find if someone more familiar with this area explained where it was to you already. It draws into pre-existing knowledge of the architecture, of communication patterns, of goals of the application that do not live within the instrumentation.
Traces that are easier to understand and explore make use of patterns that are meaningful to the investigator, regardless of their understanding of the code. For the more “understandable” telemetry data, the naming, structure, and level of detail are more related to how the information is to be presented than the structure of the underlying implementation.
Observability requires interpretation, and interpretation sits in the observer. What is useful or not will be really hard to predict, and people may find patterns and affordances in places that weren’t expected or designed, but still significant. Catering to this property requires taking a perspective of the system that is socio-technical.
Once again for this section, I agree with Hazel on the importance of people in the process. She has lots of examples of good questions that exemplify this. I just want to push even harder here.
Most of the examples I’ve given so far were technical: machines and objects whose interpretation is done by humans. Real complex systems don’t limit themselves to technical components being looked at; people are involved, talking to each other, making decisions, and steering the overall system around.
This is nicely represented by the STELLA Report’s Above/Below the line diagram:
The continued success of the overall system does not purely depend on the robustness of technical components and their ability to withstand challenges for which they were designed. When challenges beyond what was planned for do happen, and when the context in which the system operates changes (whether it is due to competition, legal frameworks, pandemics, or evolving user needs and tastes), adjustments need to be made to adapt the system and keep it working.
The adaptation is not done purely on a technical level, by fixing and changing the software and hardware, but also by reconfiguring the organization, by people learning new things, by getting new or different people in the room, by reframing the situation, and by steering things in a new direction. There is a constant gap to bridge between a solution and its context, and the ability to anticipate these challenges, prepare for them, and react to them can be informed by observability.
Observability at the technical level (“instrument the code and look at it with tools”) is covered by all definitions of observability in play here, but I want to really point out that observability can go further.
If you reframe your system as properly socio-technical, then yes you will need technical observability interpreted at the social level. But you may also need social observability handled at the social level: are employees burning out? Do we have the psychological safety required to learn from events? Do I have silos of knowledge that render my organization brittle? What are people working on? Where is the market at right now? Are our users leaving us for competition? Are our employees leaving us for competitions? How do we deal with a fast-moving space with limited resources?
There are so many ways for an organization to fail that aren’t technical, and ideally we’d also keep an eye on them. A definition of observability that is technical in nature can set artificial boundaries to your efforts to gain insights from ongoing processes. I believe Hazel’s definition maps to this ideal more clearly than the cognitive engineering one, but I want to re-state the need to avoid strictly framing its application to technical components observed by people.
A specific dynamic I haven’t mentioned here—and this is something disciplines like cybernetics, cognitive engineering, and resilience engineering all have interests for—is one where the technical elements of the system know about the social elements of the system. We essentially do not currently have automation (nor AI) sophisticated enough to be good team members.
For example, while I can detect a coworker is busy managing a major outage or meeting with important customers in the midst of a contract renewal, pretty much no alerting system will be able to find that information and decide to ask for assistance from someone else who isn’t as busy working on high-priority stuff within the system. The ability of one agent to shape their own work based on broader objectives than their private ones is something that requires being able to observe other agents in the system, map that to higher goals, and shape their own behaviour accordingly.
Ultimately, a lot of decisions are made through attempts at steering the system or part of it in a given direction. This needs some sort of [mental] map of the relationships in the system, and an even harder thing to map out is the impact of having this information will have on the system itself.
Recently I was at work trying to map concepts about reliability, and came up with this mess of a diagram showing just a tiny portion of what I think goes into influencing system reliability (the details are unimportant):
In this concept map, very few things are purely technical; lots of work is social, process-driven, and is about providing feedback loops. As the system grows more complex, analysis and control lose some power, and sense-making and influence become more applicable.
The overall system becomes unknowable, and nearly impossible to map out—by the time the map is complete, it’s already outdated. On top of that, the moment the above map becomes used to make decisions, its own influence might need to become part of itself, since it has entered the feedback loop of how decisions are made. These things can’t be planned out, and sometimes can only be discovered in a timely manner by acting.
Basically, the point here is that not everything is observable via data availability and search. Some questions you have can only be answered by changing the system, either through adding new data, or by extracting the data through probing of the system. Try a bunch of things and look at the consequences.
A few years ago, I was talking with David Woods (to be exact, he was telling me useful things and I was listening) and he compared complex socio-technical systems to a messy web; everything is somehow connected to everything, and it’s nearly impossible to just keep track of all the relevant connections in your head. Things change and some elements will be more relevant today than they were yesterday. As we walk the web, we rediscover connections that are important, some that stopped being so significant, and so on.
Experimental practices like chaos engineering or fault injection aren’t just about testing behaviour for success and failure, they are also about deliberately exploring the connections and parts of the web we don’t venture into as often as we’d need to in order to maintain a solid understanding of it.
One thing to keep in mind is that the choice of which experiment to run is also based on the existing map and understanding of situations and failures that might happen. There is a risk in the planners and decision-makers not considering themselves to be part of the system they are studying, and of ignoring their own impact and influence.
This leads to elements such as pressures, goal conflicts, and adaptations to them, which may tend to only become visible during incidents. The framing of what to investigate, how to investigate it, how errors are constructed, which questions are worth asking or not worth asking all participate to the weird complex feedback loop within the big messy systems we’re in. The tools required for that level of analysis are however very, very different from what most observability vendors provide, and are generally never marketed as such, which does tie back on Hazel’s conclusion that “observability is organizational learning.”
Our complex socio-technical systems aren’t closed systems. Competitors exist; employees bring in their personal life into their work life; the environment and climate in which we live plays a role. It’s all intractable, but simplified models help make bits of it all manageable, at least partially. A key criterion is knowing when a model is worth using and when it is insufficient.
Hazel Weakly again hints at this when she states:
I don’t think these versions are in conflict. They are models, and models have limits and contextual uses. In general I’d like to reframe these models as:
They work on different concerns by picking a different area of focus, and therefore highlight different parts of the overall system. It’s a bit like how looking at phenomena at a human scale with your own eyes, at a micro scale with a microscope, and at an astronomical scale with a telescope, all provide you with useful information despite operating at different levels. While the astronomical scale tools may not bear tons of relevance at the microscopic scale operations, they can all be part of the same overall search of understanding.
Much like observability can be improved despite having less data if it is structured properly, a few simpler models can let you make better decisions in the proper context.
My hope here was not to invalidate anything Hazel posted, but to keep the validity and specificity of the other models through additional contextualization.
Welcome to our series of case studies about companies using Elixir in production.
Veeps is a streaming service that offers direct access to live and on-demand events by award-winning artists at the most iconic venues. Founded in 2018, it became part of Live Nation Entertainment in 2021.
Veeps have been named one of the ten most innovative companies in music and nominated for an Emmy. They currently hold the Guinness World Record for the world’s largest ticketed livestream by a solo male artist—a performance where Elixir and Phoenix played an important role in the backend during the streaming.
This case study examines how Elixir drove Veeps’ technical transformation, surpassing high-scale demands while keeping the development team engaged and productive.
Imagine you are tasked with building a system that can livestream a music concert to hundreds of thousands of viewers around the world at the same time.
In some cases, users must purchase a ticket before the concert can be accessed. For a famous artist, it’s not uncommon to see thousands of fans continuously refreshing their browsers and attempting to buy tickets within the first few minutes of the announcement.
The Veeps engineering team needed to handle both challenges.
Early on, the Veeps backend was implemented in Ruby on Rails. Its first version could handle a few thousand simultaneous users watching a concert without any impact to stream quality, which was fine when you have a handful of shows but would be insufficient with the expected show load and massive increase in concurrent viewership across streams. It was around that time that Vincent Franco joined Veeps as their CTO.
Vincent had an extensive background in building and maintaining ticketing and event management software at scale. So, he used that experience to further improve the system to handle tens of thousands of concurrent users. However, it became clear that improving it to hundreds of thousands would be a difficult challenge, requiring substantial engineering efforts and increased operational costs. The team began evaluating other stacks that could provide the out-of-the-box tooling for scaling in order to reach both short and long-term goals.
Vincent, who had successfully deployed Elixir as part of high-volume systems in the past, believed Elixir was an excellent fit for Veeps’ requirements.
Backed by his experience and several case studies from the Elixir community, such as the one from Discord, Vincent convinced management that Elixir could address their immediate scaling needs and become a reliable foundation on which the company could build.
With buy-in from management, the plan was set in motion. They had two outstanding goals:
Vincent knew that hiring right-fit technical people can take time and he didn’t want to rush the process. Hence, he hired DockYard to rebuild the system while simultaneously searching for the right candidates to build out the team.
Eight months later, the system had been entirely rewritten in Elixir and Phoenix. Phoenix Channels were used to enrich the live concert experience, while Phoenix LiveView empowered the ticket shopping journey.
The rewrite was put to the test shortly after with a livestream that remains one of Veeps’ biggest, still to this day. Before the rewrite, 20 Rails nodes were used during big events, whereas now, the same service requires only 2 Elixir nodes. And the new platform was able to handle 83x more concurrent users than the previous system.
The increase in infrastructure efficiency significantly reduced the need for complex auto-scaling solutions while providing ample capacity to handle high-traffic spikes.
The rewrite marked the most extensive and intricate system migration in my career, and yet, it was also the smoothest.
- Vincent Franco, CTO
This was a big testament to Elixir and Phoenix’s scalability and gave the team confidence that they made the right choice.
By the time the migration was completed, Veeps had also assembled an incredibly talented team of two backend and two frontend engineers, which continued to expand and grow the product.
After using Elixir for more than two years, Veeps has experienced significant benefits. Here are a few of them.
Different parts of the Veeps system have different scalability requirements. For instance, when streaming a show, the backend receives metadata from users’ devices every 30 seconds to track viewership. This is the so-called Beaconing service.
Say you have 250,000 people watching a concert: the Beaconing service needs to handle thousands of requests per second for a few hours at a time. As a result, it needs to scale differently from other parts of the system, such as the merchandise e-commerce or backstage management.
To tackle this issue, they built a distributed system. They packaged each subsystem as an Elixir release, totaling five releases. For the communication layer, they used distributed Erlang, which is built into Erlang/OTP, allowing seamless inter-process communication across networked nodes.
In a nutshell, each node contains several processes with specific responsibilities. Each of these processes belongs to their respective distributed process group. If node A needs billing information, it will reach out to any process within the “billing process group”, which may be anywhere in the cluster.
When deploying a new version of the system, they deploy a new cluster altogether, with all five subsystems at once. Given Elixir’s scalability, the whole system uses 9 nodes, making a simple deployment strategy affordable and practical. As we will see, this approach is well-supported during development too, thanks to the use of Umbrella Projects.
Although they run a distributed system, they organize the code in only one repository, following the monorepo approach. To do that, they use the Umbrella Project feature from Mix, the build tool that ships with Elixir.
Their umbrella project consists of 16 applications (at the time of writing), which they sliced into five OTP releases. The remaining applications contain code that needs to be shared between multiple applications. For example, one of the shared applications defines all the structs sent as messages across the subsystems, guaranteeing that all subsystems use the same schemas for that exchanged data.
With umbrella projects, you can have the developer experience benefits of a single code repository, while being able to build a service-oriented architecture.
- Andrea Leopardi, Principal Engineer
Veeps has an e-commerce platform that allows concert viewers to purchase artist merchandise. In e-commerce, a common concept is a shopping cart. Veeps associates each shopping cart as a GenServer, which is a lightweight process managed by the Erlang VM.
This decision made it easier for them to implement other business requirements, such as locking the cart during payments and shopping cart expiration. Since each cart is a process, the expiration is as simple as sending a message to a cart process based on a timer, which is easy to do using GenServers.
For caching, the team relies on ETS (Erlang Term Storage), a high-performing key-value store part of the Erlang standard library. For cache busting between multiple parts of the distributed system, they use Phoenix PubSub, a real-time publisher/subscriber library that comes with Phoenix.
Before the rewrite, the Beaconing service used Google’s Firebase. Now, the system uses Broadway to ingest data from hundreds of thousands of HTTP requests from concurrent users. Broadway is an Elixir library for building concurrent data ingestion and processing pipelines. They utilized the library’s capabilities to efficiently send requests to AWS services, regulating batch sizes to comply with AWS limits. They also used it to handle rate limiting to adhere to AWS service constraints. All of this was achieved with Broadway’s built-in functionality.
Finally, they use Oban, an Elixir library for background jobs, for all sorts of background-work use cases.
Throughout the development journey, Veeps consistently found that Elixir and its ecosystem had built-in solutions for their technical challenges. Here’s what Vincent, CTO of Veeps, had to say about that:
Throughout my career, I’ve worked with large-scale systems at several companies. However, at Veeps, it’s unique because we achieve this scale with minimal reliance on external tools. It’s primarily just Elixir and its ecosystem that empower us.
- Vincent Franco, CTO
This operational simplicity benefitted not only the production environment but also the development side. The team could focus on learning Elixir and its ecosystem without the need to master additional technologies, resulting in increased productivity.
After the rewrite, LiveView, a Phoenix library for building interactive, real-time web apps, was used for every part of the front-end except for the “Onstage” subsystem (responsible for the live stream itself).
The two front-end developers, who came from a React background, also started writing LiveView. After this new experience, the team found the process of API negotiation between the front-end and back-end engineers much simpler compared to when using React. This was because they only had to use Elixir modules and functions instead of creating additional HTTP API endpoints and all the extra work that comes with them, such as API versioning.
Our front-end team, originally proficient in React, has made a remarkable transition to LiveView. They’ve wholeheartedly embraced its user-friendly nature and its smooth integration into our system.
- Vincent Franco, CTO
The decision to use Elixir has paid dividends beyond just system scalability. The team, with varied backgrounds in Java, PHP, Ruby, Python, and Javascript, found Elixir’s ecosystem to be a harmonious balance of simplicity and power.
By embracing Elixir’s ecosystem, including Erlang/OTP, Phoenix, LiveView, and Broadway, they built a robust system, eliminated the need for numerous external dependencies, and kept productively developing new features.
Throughout my career, I’ve never encountered a developer experience as exceptional as this. Whether it’s about quantity or complexity, tasks seem to flow effortlessly. The team’s morale is soaring, everyone is satisfied, and there’s an unmistakable atmosphere of positivity. We’re all unequivocally enthusiastic about this language.
- Vincent Franco, CTO
Veeps’ case illustrates how Elixir effectively handles high-scale challenges while keeping the development process straightforward and developer-friendly.
Erlang/OTP 27.0-rc1 is the first release candidate of three before the OTP 27.0 release.
The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue at https://github.com/erlang/otp/issues or by posting to Erlang Forums.
All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-15.0-rc1/doc. You can also install the latest release using kerl like this:
kerl build 27.0-rc1 27.0-rc1.
Erlang/OTP 27 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.
Many thanks to all contributors!
EEP-59 has been implemented. Documentation attributes in source files can now be used to document functions, types, callbacks, and modules.
The entire Erlang/OTP documentation is now using the new documentation system.
Triple-Quoted Strings has been implemented as per EEP 64 to allow a string to encompass a complete paragraph.
Adjacent string literals without intervening white space is now a syntax error, to avoid possible confusion with triple-quoted strings.
Sigils on string literals (both ordinary and triple-quoted) have
been implemented as per
EEP 66. For example,
~"Björn"
or ~b"Björn"
are now equivalent to <<"Björn"/utf8>>
.
The compiler will now merge consecutive updates of the same record.
Safe destructive update of tuples has been implemented in the compiler and runtime system. This allows the VM to update tuples in-place when it is safe to do so, thus improving performance by doing less copying but also by producing less garbage.
The maybe
expression is now enabled by default, eliminating the need
for enabling the maybe_expr
feature.
Native coverage support has been implemented in the JIT. It will
automatically be used by the cover
tool to reduce the execution
overhead when running cover-compiled code. There are also new APIs
to support native coverage without using the cover
tool.
The compiler will now raise a warning when updating record/map literals
to catch a common mistake. For example, the compiler will now emit a
warning for #r{a=1}#r{b=2}
.
The erl
command now supports the -S
flag, which is similar to
the -run
flag, but with some of the rough edges filed off.
By default, escripts will now be compiled instead of interpreted. That
means that the compiler
application must be installed.
The default process limit has been raised to 1048576
processes.
The erlang:system_monitor/2
functionality is now able to monitor long
message queues in the system.
The obsolete and undocumented support for opening a port to an external
resource by passing an atom (or a string) as first argument to
open_port()
, implemented by the vanilla driver,
has been removed. This feature has been scheduled for removal in OTP 27
since the release of OTP 26.
The pid
field has been removed from erlang:fun_info/1,2
.
Multiple trace sessions are now supported.
Several new functions that accept funs have been added to module timer
.
The functions is_equal/2
, map/2
, and filtermap/2
have been added to
the modules sets
, ordsets
, and gb_sets
.
There are new efficient ets
traversal functions with guaranteed atomicity.
For example, ets:next/2
followed by ets:lookup/2
can now be replaced
with ets:next_lookup/1
.
The new function ets:update_element/4
is similar to ets:update_element/3
,
but takes a default tuple as the fourth argument, which will be inserted
if no previous record with that key exists.
binary:replace/3,4
now supports using a fun for supplying the
replacement binary.
The new function proc_lib:set_label/1
can be used to add a descriptive
term to any process that does not have a registered name. The name will
be shown by tools such as c:i/0
and observer
, and it will be included
in crash reports produced by processes using gen_server
, gen_statem
,
gen_event
, and gen_fsm
.
Added functions to retrieve the next higher or lower key/element from
gb_trees
and gb_sets
, as well as returning iterators that start at
given keys/elements.
Calls to ct:capture_start/0
and ct:capture_stop/0
are now synchronous to
ensure that all output is captured.
The default CSS will now include a basic dark mode handling if it is preferred by the browser.
crypto_dyn_iv_init/3
and crypto_dyn_iv_update/3
that were marked as deprecated in Erlang/OTP 25 have been removed.--gui
option for Dialyzer has been removed.ssl
client can negotiate and handle certificate status request (OCSP
stapling support on the client side).tprof
, which combines the functionality of eprof
and cprof
under one interface. It also adds heap profiling.xmerl_xml
, a new export module xmerl_xml_indent
that provides out-of-the box indented output has been added.For more details about new features and potential incompatibilities see the README.
This document contains various resources and quick definition of a lot of background information behind distributed systems. It is not complete, even though it is kinda sorta detailed. I had written it some time in 2019 when coworkers at the time had asked for a list of references, and I put together what I thought was a decent overview of the basics of distributed systems literature and concepts.
Since I was asked for resources again recently, I decided to pop this text into my blog. I have verified the links again and replaced those that broke with archive links or other ones, but have not sought alternative sources when the old links worked, nor taken the time to add any extra content for new material that may have been published since then.
It is meant to be used as a quick reference to understand various distsys discussions, and to discover the overall space and possibilities that are around this environment.
This is information providing the basics of all the distsys theory. Most of the papers or resources you read will make references to some of these concepts, so explaining them makes sense.
There are three model types used by computer scientists doing distributed system theory:
A synchronous model means that each message sent within the system has a known upper bound on communications (max delay between a message being sent and received) and the processing speed between nodes or agents. This means that you can know for sure that after a period of time, a message was missed. This model is applicable in rare cases, such as hardware signals, and is mostly beginner mode for distributed system proofs.
An asynchronous model means that you have no upper bound. It is legit for agents and nodes to process and delay things indefinitely. You can never assume that a "lost" message you haven't seen for the last 15 years won't just happen to be delivered tomorrow. The other node can also be stuck in a GC loop that lasts 500 centuries, that's good.
Proving something works on asynchronous model means it works with all other types. This is expert mode for proofs and is even trickier than real world implementations to make work in most cases.
The Semi-synchronous models are the cheat mode for real world. There are upper-bounds to the communication mechanisms and nodes everywhere, but they are often configurable and unspecified. This is what lets a protocol designer go "you know what, we're gonna stick a ping message in there, and if you miss too many of them we consider you're dead."
You can't assume all messages are delivered reliably, but you give yourself a chance to say "now that's enough, I won't wait here forever."
Protocols like Raft, Paxos, and ZAB (quorum protocols behind etcd, Chubby, and ZooKeeper respectively) all fit this category.
The way failures happen and are detected is important to a bunch of algorithms. The following are the most commonly used ones:
First, Fail-stop failures mean that if a node encounters a problem, everyone can know about it and detect it, and can restore state from stable storage. This is easy mode for theory and protocols, but super hard to achieve in practice (and in some cases impossible)
Crash failures mean that if a node or agent has a problem, it crashes and then never comes back. You are either correct or late forever. This is actually easier to design around than fail-stop in theory (but a huge pain to operate because redundancy is the name of the game, forever).
Omission failures imply that you give correct results that respect the protocol or never answer.
Performance failures assumes that while you respect the protocol in terms of the content of messages you send, you will also possibly send results late.
Byzantine failures means that anything can go wrong (including people willingly trying to break you protocol with bad software pretending to be good software). There's a special class of authentication-detectable byzantine failures which at least put the constraint that you can't forge other messages from other nodes, but that is an optional thing. Byzantine modes are the worst.
By default, most distributed system theory assumes that there are no bad actors or agents that are corrupted and willingly trying to break stuff, and byzantine failures are left up to blockchains and some forms of package management.
Most modern papers and stuff will try and stick with either crash or fail-stop failures since they tend to be practical.
See this typical distsys intro slide deck for more details.
This is one of the core problems in distributed systems: how can all the nodes or agents in a system agree on one value? The reason it's so important is that if you can agree on just one value, you can then do a lot of things.
The most common example of picking a single very useful value is the name of an elected leader that enforces decisions, just so you can stop having to build more consensuses because holy crap consensuses are painful.
Variations exist on what exactly is a consensus, including does everyone agree fully? (strong) or just a majority? (t-resilient) and asking the same question in various synchronicity or failure models.
Note that while classic protocols like Paxos use a leader to ensure consistency and speed up execution while remaining consistent, a bunch of systems will forgo these requirements.
Stands for Fischer-Lynch-Patterson, the authors of a 1985 paper that states that proper consensus where all participants agree on a value is unsolvable in a purely asynchronous model (even though it is in a synchronous model) as long as any kind of failure is possible, even if they're just delays.
It's one of the most influential papers in the arena because it triggered a lot of other work for other academics to define what exactly is going on in distributed systems.
Following FLP results, which showed that failure detection was kind of super-critical to making things work, a lot of computer science folks started working on what exactly it means to detect failures.
This stuff is hard and often much less impressive than we'd hope for it to be. There are strong and weak fault detectors. The former implies all faulty processes are eventually identified by all non-faulty ones, and the latter that only some non-faulty processes find out about faulty ones.
Then there are degrees of accuracy:
You can possibly realize that a strong and fully accurate detector (said to be perfect) kind of implies that you get a consensus, and since consensus is not really doable in a fully asynchronous system model with failures, then there are hard limits to things you can detect reliably.
This is often why semi-synchronous system models make sense: if you treat delays greater than T to be a failure, then you can start doing adequate failure detection.
See this slide deck for a decent intro
The CAP theorem was for a long while just a conjecture, but has been proven in the early 2000s, leading to a lot of eventually consistent databases.
There are three properties to a distributed system:
In theory, you can get a system that is both available and consistent, but only under synchronous models on a perfect network. Those don't really exist so in practice P is always there.
What the CAP theorem states is essentially that givenP, you have to choose either A (keep accepting writes and potentially corrupt data) orC (stop accepting writes to save the data, and go down).
CAP is a bit strict in what you get in practice. Not all partitions in a network are equivalent, and not all consistency levels are the same.
Two of the most common approaches to add some flexibility to the CAP theorem are the Yield/Harvest models and PACELC.
Yield/Harvest essentially says that you can think of the system differently: yield is your ability to fulfill requests (as in uptime), and harvest is the fraction of all the potential data you can actually return. Search engines are a common example here, where they will increase their yield and answer more often by reducing their harvest when they ignore some search results to respond faster if at all.
PACELC adds the idea that eventually-consistent databases are overly strict. In case of network Partitioning you have to choose between Availability or Consistency, but Else --when the system is running normally--one has to choose between Latency and Consistency. The idea is that you can decide to degrade your consistency for availability (but only when you really need to), or you could decide to always forego consistency because you gotta go fast.
It is important to note that you cannot beat the CAP theorem (as long as you respect the models under which it was proven), and anyone claiming to do so is often a snake oil salesman.
There's been countless rehashes of the CAP theorem and various discussions over the years; the results are mathematically proven even if many keep trying to make the argument that they're so reliable it doesn't matter.
Messages can be sent zero or more times, in various orderings. Some terms are introduced to define what they are:
Regarding ordering:
k1
can be in a total order regarding each other, but independent from updates to the key k2
. There is therefore a partial order between all updates across all keys, since k1
updates bear no information relative to the k2
updates.There isn't a "best" ordering, each provides different possibilities and comes with different costs, optimizations, and related failure modes.
Idempotence is important enough to warrant its own entry. Idempotence means that when messages are seen more than once, resent or replayed, they don't impact the system differently than if they were sent just once.
Common strategies is for each message to be able to refer to previously seen messages so that you define an ordering that will prevent replaying older messages, setting unique IDs (such as transaction IDs) coupled with a store that will prevent replaying transactions, and so on.
See Idempotence is not a medical condition for a great read on it, with various related strategies.
This is a theoretical model by which, given the same sequences of states and the same operations applied to them (disregarding all kinds of non-determinism), all state machines will end up with the same result.
This model ends up being critical to most reliable systems out there, which tend to all try to replay all events to all subsystems in the same order, ensuring predictable data sets in all places.
This is generally done by picking a leader; all writes are done through the leader, and all the followers get a consistent replicated state of the system, allowing them to eventually become leaders or to fan-out their state to other actors.
State-based replication can be conceptually simpler to state-machine replication, with the idea that if you only replicate the state, you get the state at the end!
The problem is that it is extremely hard to make this fast and efficient. If your state is terabytes large, you don't want to re-send it on every operation. Common approaches will include splitting, hashing, and bucketing of data to detect changes and only send the changed bits (think of rsync
), merkle trees to detect changes, or the idea of a patch
to source code.
Here are a bunch of resources worth digging into for various system design elements.
Foundational practical aspect of system design for distributed systems:
The conclusion is that if you want anything to be reliable, you need an end-to-end acknowledgement, usually written by the application layer.
These ideas are behind the design of TCP as a protocol, but the authors also note that it wouldn't be sufficient to leave it at the protocol, the application layer must be involved.
The fallacies are:
Partial explanations on the Wiki page or full ones in the paper.
In practice, when you switch from Computer Science to Engineering, the types of faults you will find are a bit more varied, but can map to any of the theoretical models.
This section is an informal list of common sources of issues in a system. See also the CAP theorem checklist for other common cases.
Some nodes can talk to each other, but some nodes are unreachable to others. A common example is that a US-based network can communicate fine internally, and so could a EU-based network, but both would be unable to speak to each-other
Communication between groups of nodes is not symmetric. For example, imagine that the US network can send messages to the EU network, but the EU network cannot respond back.
This is a rarer mode when using TCP (although it has happened before), and a potentially common one when using UDP.
The way a lot of systems deal with failures is to keep a majority going. A split brain is what happens when both sides of a netsplit think they are the leader, and starts making conflicting decisions.
Timeouts are particularly tricky because they are non-deterministic. They can only be observed from one end, and you never know if a timeout that is ultimately interpreted as a failure was actually a failure, or just a delay due to networking, hardware, or GC pauses.
There are times where retransmissions are not safe if the message has already been seen (i.e. it is not idempotent), and timeouts essentially make it impossible to know if retransmission is safe to try: was the message acted on, dropped, or is it still in transit or in a buffer somewhere?
Generally, using TCP and crashes will tend to mean that few messages get missed across systems, but frequent cases can include:
The updates are received transitively across various nodes. For example, a message published by service A
on a bus (whether Kafka or RMQ) can end up read, transformed or acted on and re-published by service B
, and there is a possibility that service C
will read B
's update before A
's, causing issues in causality
Not all clocks on all systems are synchronized properly (even using NTP) and will go at different speeds.
Using a timestamp to sort through events is almost guaranteed to be a source of bugs, even moreso if the timestamps come from multiple computers.
A very common pitfall is to forget that the client that participates in a distributed system is part of it. Consistency on the server-side will not necessarily be worth much if the client can't make sense of the events or data it receives.
This is particularly insidious for database clients that do a non-idempotent transactions, time out, and have no way to know if they can try it again.
A single backup is kind of easy to handle. Multiple backups run into a problem called consistent cuts (high level view) and distributed snapshots, which means that not all the backups are taken at the same time, and this introduces inconsistencies that can be construed as corrupting data.
The good news is there's no great solution and everyone suffers the same.
There are dozens different levels of consistency, all of which are documented on Wikipedia, by Peter Bailis' paper on the topic, or overviewed by Kyle Kingsbury post on them
Note that while these definitions have clear semantics that academics tend to respect, they are not adopted uniformly or respected in various projects' or vendors' documentation in the industry.
By default, most people assume database transactions are linearizable, and they tend not to be because that's way too slow as a default.
Each database might have different semantics, so the following links may cover the most major ones.
Be aware that while the PostgreSQL documentation is likely the clearest and most easy to understand one to introduce the topic, various vendors can assign different meanings to the same standard transaction scopes.
Those are data structures that allow to create either total or partial orderings between messages or state transitions.
Most common ones are:
Interval Tree Clocks attempts to fix the issues of other clock types by requiring less space to store node-specific information and allowing a kind of built-in garbage collection. It also has one of the nicest papers ever.
CRDTs essentially are data structures that restrict operations that can be done such that they can never conflict, no matter which order they are done in or how concurrently this takes place.
Think of it as the specification on how someone would write a distributed redis that was never wrong, but only left maths behind.
This is still an active area of research and countless papers and variations are always coming out.
The bible for putting all of these views together is Designing Data-Intensive Applications by Martin Kleppmann. Be advised however that everyone I know who absolutely loves this book are people who had a good foundation in distributed systems from reading a bunch of papers, and greatly appreciated having it all put in one place. Most people I've seen read it in book clubs with the aim get better at distributed systems still found it challenging and confusing at times, and benefitted from having someone around to whom they could ask questions in order to bridge some gaps. It is still the clearest source I can imagine for everything in one place.
Today I'm hitting my 3 years mark at Honeycomb, and so I thought I'd re-publish one of my favorite short blog posts written over there, Counting Forest Fires, which has become my go-to argument when discussing incident metrics when asked to count outages and incidents.
If you were asked to evaluate how good crews were at fighting forest fires, what metric would you use? Would you consider it a regression on your firefighters' part if you had more fires this year than the last? Would the size and impact of a forest fire be a measure of their success? Would you look for the cause—such as a person lighting it, an environmental factor, etc—and act on it? Chances are that yes, that's what you'd do.
As time has gone by, we've learned interesting things about forest fires. Smokey Bear can tell us to be careful all he wants, but sometimes there's nothing we can do about fires. Climate change creates conditions where fires are going to be more likely, intense, and uncontrollable. We constantly learn from indigenous approaches to fire management, and we now leverage prescribed burns instead of trying to prevent them all.
In short, there are limits to what individuals or teams on the ground can do, and while counting fires or their acreage can be useful to know the burden or impact they have, it isn't a legitimate measure of success. Knowing whether your firefighters or whether your prevention campaigns are useful can't rely on these high-level observations, because they'll be drowned in the noise of a messy unpredictable world.
The parallel to software is obvious: there are conditions we put in place in organizations—or the whole industry—and things we do that have greater impacts than we can account for, or that can't be countered by individual teams or practitioner actions. And if you want to improve things, there are things you can measure that are going to be more useful. The type of metric I constantly argue for is to count the things you can do, not the things you hope don't happen. You hope that forest fires don't happen, but there's only so much that prevention can do. Likewise with incidents. You want to know that your response is adequate.
You want to know about risk factors to alter behavior in the immediate, or of signals that tell you the way you run things needs to be adjusted in the long-term, for sustainability. The goal isn't to prevent all incidents, because small ones here and there are useful to prevent even greater ones or to provide practice to your teams, enough that you may want to cause controlled incidents on purpose—that's chaos engineering. You want to be able to prioritize concurrent events so that you respond where it's most worth it. You want to prevent harm to your responders, and know how to limit it as much as possible on the people they serve. You want to make sure you learn from new observations and methods, and that your practice remains current with escalating challenges.
The concepts mentioned above are things you can invest in, train people for, and create conditions that can lead to success. Counting forest fires or incidents lets you estimate how bad a given season or quarter was, but it tells you almost nothing about how good of a job you did.
It tells you about challenges, about areas where you may want to invest and pay attention, and where you may need to repair and heal—more than it tells you about successes or failures. Fires are going to happen regardless, but they can be part of a healthy ecosystem. Trying to stamp them all out may do more harm than good in the long run.
The real question is: how do you want to react? And how do you measure that?
Elixir v1.16 has just been released. 🎉
The Elixir team continues improving the developer experience via tooling, documentation, and precise feedback, while keeping the language stable and compatible.
The notable improvements in this release are the addition of compiler diagnostics and extensive improvements to our docs in the forms of guides, anti-patterns, diagrams and more.
Elixir v1.15 introduced a new compiler diagnostic format and the ability to print multiple error diagnostics per compilation (in addition to multiple warnings).
With Elixir v1.16, we also include code snippets in exceptions and diagnostics raised by the compiler, including ANSI coloring on supported terminals. For example, a syntax error now includes a pointer to where the error happened:
** (SyntaxError) invalid syntax found on lib/my_app.ex:1:17:
error: syntax error before: '*'
│
1 │ [1, 2, 3, 4, 5, *]
│ ^
│
└─ lib/my_app.ex:1:17
For mismatched delimiters, it now shows both delimiters:
** (MismatchedDelimiterError) mismatched delimiter found on lib/my_app.ex:1:18:
error: unexpected token: )
│
1 │ [1, 2, 3, 4, 5, 6)
│ │ └ mismatched closing delimiter (expected "]")
│ └ unclosed delimiter
│
└─ lib/my_app.ex:1:18
For unclosed delimiters, it now shows where the unclosed delimiter starts:
** (TokenMissingError) token missing on lib/my_app:8:23:
error: missing terminator: )
│
1 │ my_numbers = (1, 2, 3, 4, 5, 6
│ └ unclosed delimiter
...
8 │ IO.inspect(my_numbers)
│ └ missing closing delimiter (expected ")")
│
└─ lib/my_app:8:23
Errors and warnings diagnostics also include code snippets. When possible, we will show precise spans, such as on undefined variables:
error: undefined variable "unknown_var"
│
5 │ a - unknown_var
│ ^^^^^^^^^^^
│
└─ lib/sample.ex:5:9: Sample.foo/1
Otherwise the whole line is underlined:
error: function names should start with lowercase characters or underscore, invalid name CamelCase
│
3 │ def CamelCase do
│ ^^^^^^^^^^^^^^^^
│
└─ lib/sample.ex:3
A huge thank you to Vinícius Müller for working on the new diagnostics.
The ExDoc package provides Elixir developers with one of the most complete and robust documentation generator. It supports API references, tutorials, cheatsheets, and more.
However, because many of the language tutorials and reference documentation were written before ExDoc, they were maintained separately as part of the official website, separate from the language source code. With Elixir v1.16, we have moved our learning material to the language repository. This provides several benefits:
Tutorials are versioned alongside their relevant Elixir version
You get full-text search across all API reference and tutorials
ExDoc will autolink module and function names in tutorials to their relevant API documentation
Another feature we have incorporated in this release is the addition of cheatsheets, starting with a cheatsheet for the Enum module. If you would like to contribute future cheatsheets to Elixir itself, feel free to start a discussion and collect feedback on the Elixir Forum.
Finally, we have started enriching our documentation with Mermaid.js diagrams. You can find examples in the GenServer and Supervisor docs.
Elixir has always been praised by its excellent documentation and we are glad to continue to raise the bar for the whole ecosystem.
Elixir v1.16 incorporates and extends the work on Understanding Code Smells in Elixir Functional Language, by Lucas Vegi and Marco Tulio Valente, from ASERG/DCC/UFMG, into the official documention in the form of anti-patterns. Our goal is to provide examples of potential pitfalls for library and application developers, with additional context and guidance on how to improve their codebases.
In earlier versions, Elixir’s official reference for library authors included a list of anti-patterns for library developers. Lucas Vegi and Marco Tulio Valente extended and refined this list based on the existing literature, articles, and community input (including feedback based on their prevalence in actual codebases).
To incorporate the anti-patterns into the language, we trimmed the list down to keep only anti-patterns which are unambiguous and actionable, and divided them into four categories: code-related, design-related, process-related, and meta-programming. Then we collected more community feedback during the release candidate period, further refining and removing unclear guidance.
We are quite happy with the current iteration of anti-patterns but this is just the beginning. As they become available to the whole community, we expect to receive more input, questions, and concerns. We will continue listening and improving, as our ultimate goal is to provide a live reference that reflects the practices of the ecosystem, rather than a document that is written in stone and ultimately gets out of date. A perfect example of this is the recent addition of “Sending unnecessary data” anti-pattern, which was contributed by the community and describes a pitfall that may happen across codebases.
As we get Elixir v1.16 out of door, the Elixir team will focus on bringing the initial core for set-theoretic types into the Elixir compiler, with the goal of running automated analysis in patterns and guards. This is the first step outlined in a previous article and is sponsored by Fresha (they are hiring!), Starfish* (they are hiring!), and Dashbit.
Other notable changes in this release are:
the addition of String.replace_invalid/2
, to help deal with invalid UTF-8 encoding
the addition of the :limit
option in Task.yield_many/2
that limits the maximum number of tasks to yield
improved binary pattern matching by allowing prefix binary matches, such as <<^prefix::binary, rest::binary>>
For a complete list of all changes, see the full release notes.
Check the Install section to get Elixir installed and read our Getting Started guide to learn more.
Happy learning!