Erlang/OTP 27.0 Release Candidate 2

OTP 27.0-rc2

Erlang/OTP 27.0-rc2 is the second release candidate of three before the OTP 27.0 release.

The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue at https://github.com/erlang/otp/issues or by posting to Erlang Forums.

All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-15.0-rc2/doc. You can also install the latest release using kerl like this:

kerl build 27.0-rc2 27.0-rc2.

Erlang/OTP 27 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.

Many thanks to all contributors!

Highlights for RC2

  • There is a new module json for encoding and decoding JSON.

    Both encoding and decoding can be customized. Decoding can be done in a SAX-like fashion and handle multiple documents and streams of data.

    The new json module is used by the jer (JSON Encoding Rules) for ASN.1 for encoding and decoding JSON. Thus, there is no longer any need to supply an external JSON library.

Other notable changes in RC2

  • The existing experimental support for archive files will be changed in a future release. The support for having an archive in an escript will remain, but the support for using archives in a release will either become more limited or completely removed.

    As of Erlang/OTP 27, the function code:lib_dir/2, the -code_path_choice flag, and using erl_prim_loader for reading members of an archive are deprecated.

    To remain compatible with future version of Erlang/OTP escript scripts that need to retrieve data files from its archive should use escript:extract/2 instead of erl_prim_loader and code:lib_dir/2.

  • The order in which the compiler looks up options has changed.

    When there is a conflict in the compiler options given in the -compile() attribute and options given to the compiler, the options given in the -compile() attribute overrides the option given to the compiler, which in turn overrides options given in the ERL_COMPILER_OPTIONS environment variable.

    Example:

    If some_module.erl has the following attribute:

    -compile([nowarn_missing_spec]).
    

    and the compiler is invoked like so:

    % erlc +warn_missing_spec some_module.erl
    

    no warnings will be issued for functions that do not have any specs.

  • configure now automatically enables support for year-2038-safe timestamps.

    By default configure scripts used when building OTP will now try to enable support for timestamps that will work after mid-January

    1. This has typically only been an issue on 32-bit platforms.

    If configure cannot figure out how to enable such timestamps, it will abort with an error message. If you want to build the system anyway, knowing that the system will not function properly after mid-January 2038, you can pass the --disable-year2038 option to configure, which will enable configure to continue without support for timestamps after mid-January 2038.

Highlights for RC1

Documentation

EEP-59 has been implemented. Documentation attributes in source files can now be used to document functions, types, callbacks, and modules.

The entire Erlang/OTP documentation is now using the new documentation system.

New language features

  • Triple-Quoted Strings has been implemented as per EEP 64 to allow a string to encompass a complete paragraph.

  • Adjacent string literals without intervening white space is now a syntax error, to avoid possible confusion with triple-quoted strings.

  • Sigils on string literals (both ordinary and triple-quoted) have been implemented as per EEP 66. For example, ~"Björn" or ~b"Björn" are now equivalent to <<"Björn"/utf8>>.

Compiler and JIT improvements

  • The compiler will now merge consecutive updates of the same record.

  • Safe destructive update of tuples has been implemented in the compiler and runtime system. This allows the VM to update tuples in-place when it is safe to do so, thus improving performance by doing less copying but also by producing less garbage.

  • The maybe expression is now enabled by default, eliminating the need for enabling the maybe_expr feature.

  • Native coverage support has been implemented in the JIT. It will automatically be used by the cover tool to reduce the execution overhead when running cover-compiled code. There are also new APIs to support native coverage without using the cover tool.

  • The compiler will now raise a warning when updating record/map literals to catch a common mistake. For example, the compiler will now emit a warning for #r{a=1}#r{b=2}.

ERTS

  • The erl command now supports the -S flag, which is similar to the -run flag, but with some of the rough edges filed off.

  • By default, escripts will now be compiled instead of interpreted. That means that the compiler application must be installed.

  • The default process limit has been raised to 1048576 processes.

  • The erlang:system_monitor/2 functionality is now able to monitor long message queues in the system.

  • The obsolete and undocumented support for opening a port to an external resource by passing an atom (or a string) as first argument to open_port(), implemented by the vanilla driver, has been removed. This feature has been scheduled for removal in OTP 27 since the release of OTP 26.

  • The pid field has been removed from erlang:fun_info/1,2.

  • Multiple trace sessions are now supported.

STDLIB

  • Several new functions that accept funs have been added to module timer.

  • The functions is_equal/2, map/2, and filtermap/2 have been added to the modules sets, ordsets, and gb_sets.

  • There are new efficient ets traversal functions with guaranteed atomicity. For example, ets:next/2 followed by ets:lookup/2 can now be replaced with ets:next_lookup/1.

  • The new function ets:update_element/4 is similar to ets:update_element/3, but takes a default tuple as the fourth argument, which will be inserted if no previous record with that key exists.

  • binary:replace/3,4 now supports using a fun for supplying the replacement binary.

  • The new function proc_lib:set_label/1 can be used to add a descriptive term to any process that does not have a registered name. The name will be shown by tools such as c:i/0 and observer, and it will be included in crash reports produced by processes using gen_server, gen_statem, gen_event, and gen_fsm.

  • Added functions to retrieve the next higher or lower key/element from gb_trees and gb_sets, as well as returning iterators that start at given keys/elements.

common_test

  • Calls to ct:capture_start/0 and ct:capture_stop/0 are now synchronous to ensure that all output is captured.

  • The default CSS will now include a basic dark mode handling if it is preferred by the browser.

crypto

  • The functions crypto_dyn_iv_init/3 and crypto_dyn_iv_update/3 that were marked as deprecated in Erlang/OTP 25 have been removed.

dialyzer

  • The --gui option for Dialyzer has been removed.

ssl

  • The ssl client can negotiate and handle certificate status request (OCSP stapling support on the client side).

tools

  • There is a new tool tprof, which combines the functionality of eprof and cprof under one interface. It also adds heap profiling.

xmerl

  • As an alternative to xmerl_xml, a new export module xmerl_xml_indent that provides out-of-the box indented output has been added.

For more details about new features and potential incompatibilities see the README.

Permalink

A Commentary on Defining Observability

2024/03/19

A Commentary on Defining Observability

Recently, Hazel Weakly has published a great article titled Redefining Observability. In it, she covers competing classical definitions observability, weaknesses they have, and offers a practical reframing of the concept in the context of software organizations (well, not only software organizations, but the examples tilt that way).

I agree with her post in most ways, and so this blog post of mine is more of an improv-like “yes, and…” response to it, in which I also try to frame the many existing models as complementary or contrasting perspectives of a general concept.

The main points I’ll try to bring here are on the topics of the difference between insights and questions, the difference between observability and data availability, reinforcing a socio-technical definition, the mess of complex systems and mapping them, and finally, a hot take on the use of models when reasoning about systems.

Insights and Questions

The control theory definition of observability, from Rudolf E. Kálmán, goes as follows:

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

The one from Woods and Hollnagel in cognitive engineering goes like this:

Observability is feedback that provides insight into a process and refers to the work needed to extract meaning from available data.

Hazel version’s, by comparison, is:

Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.

While all three definitions relate to extracting information from a system, Hazel’s definition sits at a higher level by specifically mentioning questions and actions. It’s a more complete feedback loop including some people, their mental models, and seeking to enrich them or use them.

I think that higher level ends up erasing a property of observability in the other definitions: it doesn’t have to be inquisitive nor analytical.

Let’s take a washing machine, for example. You can know whether it is being filled or spinning by sound. The lack of sound itself can be a signal about whether it is running or not. If it is overloaded, it might shake a lot and sound out of balance during the spin cycle. You don’t necessarily have to be in the same room as the washing machine nor paying attention to it to know things about its state, passively create an understanding of normalcy, and learn about some anomalies in there.

Another example here would be something as simple as a book. If you’re reading a good old paper book, you know you’re nearing the end of it just by how the pages you have read make a thicker portion of the book than those you haven’t read yet. You do not have to think about it, the information is inherent to the medium. An ebook read on an electronic device, however, will hide that information unless a design decision is made to show how many lines or words have been read, display a percentage, or a time estimate of the content left.

Observability for the ebook isn’t innate to its structure and must be built in deliberately. Similarly, you could know old PCs were doing heavy work if the disk was making noise and when the fan was spinning up; it is not possible to know as much on a phone or laptop that has an SSD and no fan unless someone builds a way to expose this data.

Associations and patterns can be formed by the observer in a way that provides information and explanations, leading to effective action and management of the system in play. It isn’t something always designed or done on purpose, but it may need to be.

The key element is that an insight can be obtained without asking questions. In fact, a lot of anomaly detection is done passively, by the observer having a sort of mental construct of what normal is that lets them figure out what should happen next—what the future trajectory of the system is—and to then start asking questions when these expectations are not met. The insights, therefore, can come before the question is asked. Observability can be described as a mechanism behind this.

I don’t think that this means Hazel’s definition is wrong; I think it might just be a consequence of the scale at which her definition operates. However, this distinction is useful for a segue into the difference between data availability and observability.

The difference between observability and data availability

For a visual example, I’ll use variations on a painting from 1866 named In a Roman Osteria, by Danish painter Carl Bloch:

two versions of the painting are shown. On the left, only a rough black and white, handdrawn outline showing 3 characters in the back, a cat on a bench next to the three main characters eating a meal. The second image on the right is just the original image cut up in shuffled pieces of a jigsaw puzzle

The first one is an outline, and the second is a jigsaw puzzle version (with all the pieces are right side up with the correct orientation, at least).

The jigsaw puzzle has 100% data availability. All of the information is there and you can fully reconstruct the initial painting. The outlined version has a lot less data available, but if you’ve never seen the painting before, you will get a better understanding from it in absolutely no time compared to the jigsaw.

This “make the jigsaw show you what you need faster” approach is more or less where a lot of observability vendors operate: the data is out there, you need help to store it and put it together and extract the relevancy out of it:

the original In a Roman Osteria painting

What this example highlights though is that while you may get better answers with richer and more accurate data (given enough time, tools, and skill), the outline is simpler and may provide adequate information with less effort required from the observer. Effective selection of data, presented better, may be more appropriate during high-tempo, tense situations where quick decisions can make a difference.

This, at least, implies that observability is not purely a data problem nor a tool problem (which lines up, again, with what Hazel said in her post). However, it hints at it being a potential design problem. The way data is presented, the way affordances are added, and whether the system is structured in a way that makes storytelling possible all can have a deep impact in how observable it turns out to be.

Sometimes, coworkers mention that some of our services are really hard to interpret even when using Honeycomb (which we build and operate!) My theory about that is that too often, the data we output for observability is structured with the assumption that the person looking at it will have the code on hand—as the author did when writing it—and will be able to map telemetry data to specific areas of code.

So when you’re in there writing queries and you don’t know much about the service, the traces mean little. As a coping mechanism, social patterns emerge where data that is generally useful is kept on some specific spans that are considered important, but that you can only find if someone more familiar with this area explained where it was to you already. It draws into pre-existing knowledge of the architecture, of communication patterns, of goals of the application that do not live within the instrumentation.

Traces that are easier to understand and explore make use of patterns that are meaningful to the investigator, regardless of their understanding of the code. For the more “understandable” telemetry data, the naming, structure, and level of detail are more related to how the information is to be presented than the structure of the underlying implementation.

Observability requires interpretation, and interpretation sits in the observer. What is useful or not will be really hard to predict, and people may find patterns and affordances in places that weren’t expected or designed, but still significant. Catering to this property requires taking a perspective of the system that is socio-technical.

The System is Socio-Technical

Once again for this section, I agree with Hazel on the importance of people in the process. She has lots of examples of good questions that exemplify this. I just want to push even harder here.

Most of the examples I’ve given so far were technical: machines and objects whose interpretation is done by humans. Real complex systems don’t limit themselves to technical components being looked at; people are involved, talking to each other, making decisions, and steering the overall system around.

This is nicely represented by the STELLA Report’s Above/Below the line diagram:

figure 4 from the report showing the

The continued success of the overall system does not purely depend on the robustness of technical components and their ability to withstand challenges for which they were designed. When challenges beyond what was planned for do happen, and when the context in which the system operates changes (whether it is due to competition, legal frameworks, pandemics, or evolving user needs and tastes), adjustments need to be made to adapt the system and keep it working.

The adaptation is not done purely on a technical level, by fixing and changing the software and hardware, but also by reconfiguring the organization, by people learning new things, by getting new or different people in the room, by reframing the situation, and by steering things in a new direction. There is a constant gap to bridge between a solution and its context, and the ability to anticipate these challenges, prepare for them, and react to them can be informed by observability.

Observability at the technical level (“instrument the code and look at it with tools”) is covered by all definitions of observability in play here, but I want to really point out that observability can go further.

If you reframe your system as properly socio-technical, then yes you will need technical observability interpreted at the social level. But you may also need social observability handled at the social level: are employees burning out? Do we have the psychological safety required to learn from events? Do I have silos of knowledge that render my organization brittle? What are people working on? Where is the market at right now? Are our users leaving us for competition? Are our employees leaving us for competitions? How do we deal with a fast-moving space with limited resources?

There are so many ways for an organization to fail that aren’t technical, and ideally we’d also keep an eye on them. A definition of observability that is technical in nature can set artificial boundaries to your efforts to gain insights from ongoing processes. I believe Hazel’s definition maps to this ideal more clearly than the cognitive engineering one, but I want to re-state the need to avoid strictly framing its application to technical components observed by people.

A specific dynamic I haven’t mentioned here—and this is something disciplines like cybernetics, cognitive engineering, and resilience engineering all have interests for—is one where the technical elements of the system know about the social elements of the system. We essentially do not currently have automation (nor AI) sophisticated enough to be good team members.

For example, while I can detect a coworker is busy managing a major outage or meeting with important customers in the midst of a contract renewal, pretty much no alerting system will be able to find that information and decide to ask for assistance from someone else who isn’t as busy working on high-priority stuff within the system. The ability of one agent to shape their own work based on broader objectives than their private ones is something that requires being able to observe other agents in the system, map that to higher goals, and shape their own behaviour accordingly.

Ultimately, a lot of decisions are made through attempts at steering the system or part of it in a given direction. This needs some sort of [mental] map of the relationships in the system, and an even harder thing to map out is the impact of having this information will have on the system itself.

Complex Systems Mapping Themselves

Recently I was at work trying to map concepts about reliability, and came up with this mess of a diagram showing just a tiny portion of what I think goes into influencing system reliability (the details are unimportant):

a very messy set of bubbles and arrows, provided with no explanations

In this concept map, very few things are purely technical; lots of work is social, process-driven, and is about providing feedback loops. As the system grows more complex, analysis and control lose some power, and sense-making and influence become more applicable.

The overall system becomes unknowable, and nearly impossible to map out—by the time the map is complete, it’s already outdated. On top of that, the moment the above map becomes used to make decisions, its own influence might need to become part of itself, since it has entered the feedback loop of how decisions are made. These things can’t be planned out, and sometimes can only be discovered in a timely manner by acting.

Basically, the point here is that not everything is observable via data availability and search. Some questions you have can only be answered by changing the system, either through adding new data, or by extracting the data through probing of the system. Try a bunch of things and look at the consequences.

A few years ago, I was talking with David Woods (to be exact, he was telling me useful things and I was listening) and he compared complex socio-technical systems to a messy web; everything is somehow connected to everything, and it’s nearly impossible to just keep track of all the relevant connections in your head. Things change and some elements will be more relevant today than they were yesterday. As we walk the web, we rediscover connections that are important, some that stopped being so significant, and so on.

Experimental practices like chaos engineering or fault injection aren’t just about testing behaviour for success and failure, they are also about deliberately exploring the connections and parts of the web we don’t venture into as often as we’d need to in order to maintain a solid understanding of it.

One thing to keep in mind is that the choice of which experiment to run is also based on the existing map and understanding of situations and failures that might happen. There is a risk in the planners and decision-makers not considering themselves to be part of the system they are studying, and of ignoring their own impact and influence.

This leads to elements such as pressures, goal conflicts, and adaptations to them, which may tend to only become visible during incidents. The framing of what to investigate, how to investigate it, how errors are constructed, which questions are worth asking or not worth asking all participate to the weird complex feedback loop within the big messy systems we’re in. The tools required for that level of analysis are however very, very different from what most observability vendors provide, and are generally never marketed as such, which does tie back on Hazel’s conclusion that “observability is organizational learning.”

A Hot Take on the Use of Models

Our complex socio-technical systems aren’t closed systems. Competitors exist; employees bring in their personal life into their work life; the environment and climate in which we live plays a role. It’s all intractable, but simplified models help make bits of it all manageable, at least partially. A key criterion is knowing when a model is worth using and when it is insufficient.

Hazel Weakly again hints at this when she states:

  • The control theory version gives you a way to know whether a system is observable or not, but it ignores the people and gives you no way to get there
  • The cognitive engineering version is better, but doesn’t give you a “why” you should care, nor any idea of where you are and where to go
  • Her version provides a motivation and a sense of direction as a process

I don’t think these versions are in conflict. They are models, and models have limits and contextual uses. In general I’d like to reframe these models as:

  • What data may be critical to provide from a technical component’s point of view (control theory model)
  • How people may process the data and find significance in it (cognitive engineering model)
  • How organizations should harness the mechanism as a feedback loop to learn and improve (Hazel Weakly’s model)

They work on different concerns by picking a different area of focus, and therefore highlight different parts of the overall system. It’s a bit like how looking at phenomena at a human scale with your own eyes, at a micro scale with a microscope, and at an astronomical scale with a telescope, all provide you with useful information despite operating at different levels. While the astronomical scale tools may not bear tons of relevance at the microscopic scale operations, they can all be part of the same overall search of understanding.

Much like observability can be improved despite having less data if it is structured properly, a few simpler models can let you make better decisions in the proper context.

My hope here was not to invalidate anything Hazel posted, but to keep the validity and specificity of the other models through additional contextualization.

Permalink

Scaling a streaming service to hundreds of thousands of concurrent viewers at Veeps

Welcome to our series of case studies about companies using Elixir in production.

Veeps is a streaming service that offers direct access to live and on-demand events by award-winning artists at the most iconic venues. Founded in 2018, it became part of Live Nation Entertainment in 2021.

Veeps have been named one of the ten most innovative companies in music and nominated for an Emmy. They currently hold the Guinness World Record for the world’s largest ticketed livestream by a solo male artist—a performance where Elixir and Phoenix played an important role in the backend during the streaming.

This case study examines how Elixir drove Veeps’ technical transformation, surpassing high-scale demands while keeping the development team engaged and productive.

The challenge: scaling to hundreds of thousands of simultaneous users

Imagine you are tasked with building a system that can livestream a music concert to hundreds of thousands of viewers around the world at the same time.

In some cases, users must purchase a ticket before the concert can be accessed. For a famous artist, it’s not uncommon to see thousands of fans continuously refreshing their browsers and attempting to buy tickets within the first few minutes of the announcement.

The Veeps engineering team needed to handle both challenges.

Early on, the Veeps backend was implemented in Ruby on Rails. Its first version could handle a few thousand simultaneous users watching a concert without any impact to stream quality, which was fine when you have a handful of shows but would be insufficient with the expected show load and massive increase in concurrent viewership across streams. It was around that time that Vincent Franco joined Veeps as their CTO.

Vincent had an extensive background in building and maintaining ticketing and event management software at scale. So, he used that experience to further improve the system to handle tens of thousands of concurrent users. However, it became clear that improving it to hundreds of thousands would be a difficult challenge, requiring substantial engineering efforts and increased operational costs. The team began evaluating other stacks that could provide the out-of-the-box tooling for scaling in order to reach both short and long-term goals.

Adopting Elixir, hiring, and rewriting the system

Vincent, who had successfully deployed Elixir as part of high-volume systems in the past, believed Elixir was an excellent fit for Veeps’ requirements.

Backed by his experience and several case studies from the Elixir community, such as the one from Discord, Vincent convinced management that Elixir could address their immediate scaling needs and become a reliable foundation on which the company could build.

With buy-in from management, the plan was set in motion. They had two outstanding goals:

  • Prepare the platform to welcome the most famous artists in the world.
  • Build their own team of engineers to help innovate and evolve the product.

Vincent knew that hiring right-fit technical people can take time and he didn’t want to rush the process. Hence, he hired DockYard to rebuild the system while simultaneously searching for the right candidates to build out the team.

Eight months later, the system had been entirely rewritten in Elixir and Phoenix. Phoenix Channels were used to enrich the live concert experience, while Phoenix LiveView empowered the ticket shopping journey.

The rewrite was put to the test shortly after with a livestream that remains one of Veeps’ biggest, still to this day. Before the rewrite, 20 Rails nodes were used during big events, whereas now, the same service requires only 2 Elixir nodes. And the new platform was able to handle 83x more concurrent users than the previous system.

The increase in infrastructure efficiency significantly reduced the need for complex auto-scaling solutions while providing ample capacity to handle high-traffic spikes.

The rewrite marked the most extensive and intricate system migration in my career, and yet, it was also the smoothest.

- Vincent Franco, CTO

This was a big testament to Elixir and Phoenix’s scalability and gave the team confidence that they made the right choice.

By the time the migration was completed, Veeps had also assembled an incredibly talented team of two backend and two frontend engineers, which continued to expand and grow the product.

Perceived benefits of using Elixir and its ecosystem

After using Elixir for more than two years, Veeps has experienced significant benefits. Here are a few of them.

Architectural simplicity

Different parts of the Veeps system have different scalability requirements. For instance, when streaming a show, the backend receives metadata from users’ devices every 30 seconds to track viewership. This is the so-called Beaconing service.

Say you have 250,000 people watching a concert: the Beaconing service needs to handle thousands of requests per second for a few hours at a time. As a result, it needs to scale differently from other parts of the system, such as the merchandise e-commerce or backstage management.

To tackle this issue, they built a distributed system. They packaged each subsystem as an Elixir release, totaling five releases. For the communication layer, they used distributed Erlang, which is built into Erlang/OTP, allowing seamless inter-process communication across networked nodes.

In a nutshell, each node contains several processes with specific responsibilities. Each of these processes belongs to their respective distributed process group. If node A needs billing information, it will reach out to any process within the “billing process group”, which may be anywhere in the cluster.

When deploying a new version of the system, they deploy a new cluster altogether, with all five subsystems at once. Given Elixir’s scalability, the whole system uses 9 nodes, making a simple deployment strategy affordable and practical. As we will see, this approach is well-supported during development too, thanks to the use of Umbrella Projects.

Service-oriented architecture within a monorepo

Although they run a distributed system, they organize the code in only one repository, following the monorepo approach. To do that, they use the Umbrella Project feature from Mix, the build tool that ships with Elixir.

Their umbrella project consists of 16 applications (at the time of writing), which they sliced into five OTP releases. The remaining applications contain code that needs to be shared between multiple applications. For example, one of the shared applications defines all the structs sent as messages across the subsystems, guaranteeing that all subsystems use the same schemas for that exchanged data.

With umbrella projects, you can have the developer experience benefits of a single code repository, while being able to build a service-oriented architecture.

- Andrea Leopardi, Principal Engineer

Reducing complexity with the Erlang/Elixir toolbox

Veeps has an e-commerce platform that allows concert viewers to purchase artist merchandise. In e-commerce, a common concept is a shopping cart. Veeps associates each shopping cart as a GenServer, which is a lightweight process managed by the Erlang VM.

This decision made it easier for them to implement other business requirements, such as locking the cart during payments and shopping cart expiration. Since each cart is a process, the expiration is as simple as sending a message to a cart process based on a timer, which is easy to do using GenServers.

For caching, the team relies on ETS (Erlang Term Storage), a high-performing key-value store part of the Erlang standard library. For cache busting between multiple parts of the distributed system, they use Phoenix PubSub, a real-time publisher/subscriber library that comes with Phoenix.

Before the rewrite, the Beaconing service used Google’s Firebase. Now, the system uses Broadway to ingest data from hundreds of thousands of HTTP requests from concurrent users. Broadway is an Elixir library for building concurrent data ingestion and processing pipelines. They utilized the library’s capabilities to efficiently send requests to AWS services, regulating batch sizes to comply with AWS limits. They also used it to handle rate limiting to adhere to AWS service constraints. All of this was achieved with Broadway’s built-in functionality.

Finally, they use Oban, an Elixir library for background jobs, for all sorts of background-work use cases.

Throughout the development journey, Veeps consistently found that Elixir and its ecosystem had built-in solutions for their technical challenges. Here’s what Vincent, CTO of Veeps, had to say about that:

Throughout my career, I’ve worked with large-scale systems at several companies. However, at Veeps, it’s unique because we achieve this scale with minimal reliance on external tools. It’s primarily just Elixir and its ecosystem that empower us.

- Vincent Franco, CTO

This operational simplicity benefitted not only the production environment but also the development side. The team could focus on learning Elixir and its ecosystem without the need to master additional technologies, resulting in increased productivity.

LiveView: simplifying the interaction between front-end and back-end developers

After the rewrite, LiveView, a Phoenix library for building interactive, real-time web apps, was used for every part of the front-end except for the “Onstage” subsystem (responsible for the live stream itself).

The two front-end developers, who came from a React background, also started writing LiveView. After this new experience, the team found the process of API negotiation between the front-end and back-end engineers much simpler compared to when using React. This was because they only had to use Elixir modules and functions instead of creating additional HTTP API endpoints and all the extra work that comes with them, such as API versioning.

Our front-end team, originally proficient in React, has made a remarkable transition to LiveView. They’ve wholeheartedly embraced its user-friendly nature and its smooth integration into our system.

- Vincent Franco, CTO

Conclusion: insights from Veeps’ Elixir experience

The decision to use Elixir has paid dividends beyond just system scalability. The team, with varied backgrounds in Java, PHP, Ruby, Python, and Javascript, found Elixir’s ecosystem to be a harmonious balance of simplicity and power.

By embracing Elixir’s ecosystem, including Erlang/OTP, Phoenix, LiveView, and Broadway, they built a robust system, eliminated the need for numerous external dependencies, and kept productively developing new features.

Throughout my career, I’ve never encountered a developer experience as exceptional as this. Whether it’s about quantity or complexity, tasks seem to flow effortlessly. The team’s morale is soaring, everyone is satisfied, and there’s an unmistakable atmosphere of positivity. We’re all unequivocally enthusiastic about this language.

- Vincent Franco, CTO

Veeps’ case illustrates how Elixir effectively handles high-scale challenges while keeping the development process straightforward and developer-friendly.

Permalink

Erlang/OTP 27.0 Release Candidate 1

OTP 27.0-rc1

Erlang/OTP 27.0-rc1 is the first release candidate of three before the OTP 27.0 release.

The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue at https://github.com/erlang/otp/issues or by posting to Erlang Forums.

All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-15.0-rc1/doc. You can also install the latest release using kerl like this:

kerl build 27.0-rc1 27.0-rc1.

Erlang/OTP 27 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.

Many thanks to all contributors!

Highlights

Documentation

EEP-59 has been implemented. Documentation attributes in source files can now be used to document functions, types, callbacks, and modules.

The entire Erlang/OTP documentation is now using the new documentation system.

New language features

  • Triple-Quoted Strings has been implemented as per EEP 64 to allow a string to encompass a complete paragraph.

  • Adjacent string literals without intervening white space is now a syntax error, to avoid possible confusion with triple-quoted strings.

  • Sigils on string literals (both ordinary and triple-quoted) have been implemented as per EEP 66. For example, ~"Björn" or ~b"Björn" are now equivalent to <<"Björn"/utf8>>.

Compiler and JIT improvements

  • The compiler will now merge consecutive updates of the same record.

  • Safe destructive update of tuples has been implemented in the compiler and runtime system. This allows the VM to update tuples in-place when it is safe to do so, thus improving performance by doing less copying but also by producing less garbage.

  • The maybe expression is now enabled by default, eliminating the need for enabling the maybe_expr feature.

  • Native coverage support has been implemented in the JIT. It will automatically be used by the cover tool to reduce the execution overhead when running cover-compiled code. There are also new APIs to support native coverage without using the cover tool.

  • The compiler will now raise a warning when updating record/map literals to catch a common mistake. For example, the compiler will now emit a warning for #r{a=1}#r{b=2}.

ERTS

  • The erl command now supports the -S flag, which is similar to the -run flag, but with some of the rough edges filed off.

  • By default, escripts will now be compiled instead of interpreted. That means that the compiler application must be installed.

  • The default process limit has been raised to 1048576 processes.

  • The erlang:system_monitor/2 functionality is now able to monitor long message queues in the system.

  • The obsolete and undocumented support for opening a port to an external resource by passing an atom (or a string) as first argument to open_port(), implemented by the vanilla driver, has been removed. This feature has been scheduled for removal in OTP 27 since the release of OTP 26.

  • The pid field has been removed from erlang:fun_info/1,2.

  • Multiple trace sessions are now supported.

STDLIB

  • Several new functions that accept funs have been added to module timer.

  • The functions is_equal/2, map/2, and filtermap/2 have been added to the modules sets, ordsets, and gb_sets.

  • There are new efficient ets traversal functions with guaranteed atomicity. For example, ets:next/2 followed by ets:lookup/2 can now be replaced with ets:next_lookup/1.

  • The new function ets:update_element/4 is similar to ets:update_element/3, but takes a default tuple as the fourth argument, which will be inserted if no previous record with that key exists.

  • binary:replace/3,4 now supports using a fun for supplying the replacement binary.

  • The new function proc_lib:set_label/1 can be used to add a descriptive term to any process that does not have a registered name. The name will be shown by tools such as c:i/0 and observer, and it will be included in crash reports produced by processes using gen_server, gen_statem, gen_event, and gen_fsm.

  • Added functions to retrieve the next higher or lower key/element from gb_trees and gb_sets, as well as returning iterators that start at given keys/elements.

common_test

  • Calls to ct:capture_start/0 and ct:capture_stop/0 are now synchronous to ensure that all output is captured.

  • The default CSS will now include a basic dark mode handling if it is preferred by the browser.

crypto

  • The functions crypto_dyn_iv_init/3 and crypto_dyn_iv_update/3 that were marked as deprecated in Erlang/OTP 25 have been removed.

dialyzer

  • The --gui option for Dialyzer has been removed.

ssl

  • The ssl client can negotiate and handle certificate status request (OCSP stapling support on the client side).

tools

  • There is a new tool tprof, which combines the functionality of eprof and cprof under one interface. It also adds heap profiling.

xmerl

  • As an alternative to xmerl_xml, a new export module xmerl_xml_indent that provides out-of-the box indented output has been added.

For more details about new features and potential incompatibilities see the README.

Permalink

A Distributed Systems Reading List

2024/02/07

A Distributed Systems Reading List

This document contains various resources and quick definition of a lot of background information behind distributed systems. It is not complete, even though it is kinda sorta detailed. I had written it some time in 2019 when coworkers at the time had asked for a list of references, and I put together what I thought was a decent overview of the basics of distributed systems literature and concepts.

Since I was asked for resources again recently, I decided to pop this text into my blog. I have verified the links again and replaced those that broke with archive links or other ones, but have not sought alternative sources when the old links worked, nor taken the time to add any extra content for new material that may have been published since then.

It is meant to be used as a quick reference to understand various distsys discussions, and to discover the overall space and possibilities that are around this environment.

Foundational theory

This is information providing the basics of all the distsys theory. Most of the papers or resources you read will make references to some of these concepts, so explaining them makes sense.

Models

In a Nutshell

There are three model types used by computer scientists doing distributed system theory:

  1. synchronous models
  2. semi-synchronous models
  3. asynchronous models

A synchronous model means that each message sent within the system has a known upper bound on communications (max delay between a message being sent and received) and the processing speed between nodes or agents. This means that you can know for sure that after a period of time, a message was missed. This model is applicable in rare cases, such as hardware signals, and is mostly beginner mode for distributed system proofs.

An asynchronous model means that you have no upper bound. It is legit for agents and nodes to process and delay things indefinitely. You can never assume that a "lost" message you haven't seen for the last 15 years won't just happen to be delivered tomorrow. The other node can also be stuck in a GC loop that lasts 500 centuries, that's good.

Proving something works on asynchronous model means it works with all other types. This is expert mode for proofs and is even trickier than real world implementations to make work in most cases.

The Semi-synchronous models are the cheat mode for real world. There are upper-bounds to the communication mechanisms and nodes everywhere, but they are often configurable and unspecified. This is what lets a protocol designer go "you know what, we're gonna stick a ping message in there, and if you miss too many of them we consider you're dead."

You can't assume all messages are delivered reliably, but you give yourself a chance to say "now that's enough, I won't wait here forever."

Protocols like Raft, Paxos, and ZAB (quorum protocols behind etcd, Chubby, and ZooKeeper respectively) all fit this category.

Theoretical Failure Modes

The way failures happen and are detected is important to a bunch of algorithms. The following are the most commonly used ones:

  1. Fail-stop failures
  2. Crash failures
  3. Omission failures
  4. Performance failures
  5. Byzantine failures

First, Fail-stop failures mean that if a node encounters a problem, everyone can know about it and detect it, and can restore state from stable storage. This is easy mode for theory and protocols, but super hard to achieve in practice (and in some cases impossible)

Crash failures mean that if a node or agent has a problem, it crashes and then never comes back. You are either correct or late forever. This is actually easier to design around than fail-stop in theory (but a huge pain to operate because redundancy is the name of the game, forever).

Omission failures imply that you give correct results that respect the protocol or never answer.

Performance failures assumes that while you respect the protocol in terms of the content of messages you send, you will also possibly send results late.

Byzantine failures means that anything can go wrong (including people willingly trying to break you protocol with bad software pretending to be good software). There's a special class of authentication-detectable byzantine failures which at least put the constraint that you can't forge other messages from other nodes, but that is an optional thing. Byzantine modes are the worst.

By default, most distributed system theory assumes that there are no bad actors or agents that are corrupted and willingly trying to break stuff, and byzantine failures are left up to blockchains and some forms of package management.

Most modern papers and stuff will try and stick with either crash or fail-stop failures since they tend to be practical.

See this typical distsys intro slide deck for more details.

Consensus

This is one of the core problems in distributed systems: how can all the nodes or agents in a system agree on one value? The reason it's so important is that if you can agree on just one value, you can then do a lot of things.

The most common example of picking a single very useful value is the name of an elected leader that enforces decisions, just so you can stop having to build more consensuses because holy crap consensuses are painful.

Variations exist on what exactly is a consensus, including does everyone agree fully? (strong) or just a majority? (t-resilient) and asking the same question in various synchronicity or failure models.

Note that while classic protocols like Paxos use a leader to ensure consistency and speed up execution while remaining consistent, a bunch of systems will forgo these requirements.

FLP Result

In A Nutshell

Stands for Fischer-Lynch-Patterson, the authors of a 1985 paper that states that proper consensus where all participants agree on a value is unsolvable in a purely asynchronous model (even though it is in a synchronous model) as long as any kind of failure is possible, even if they're just delays.

It's one of the most influential papers in the arena because it triggered a lot of other work for other academics to define what exactly is going on in distributed systems.

Detailed reading

Fault Detection

Following FLP results, which showed that failure detection was kind of super-critical to making things work, a lot of computer science folks started working on what exactly it means to detect failures.

This stuff is hard and often much less impressive than we'd hope for it to be. There are strong and weak fault detectors. The former implies all faulty processes are eventually identified by all non-faulty ones, and the latter that only some non-faulty processes find out about faulty ones.

Then there are degrees of accuracy:

  1. Nobody who has not crashed is suspected of being crashed
  2. It's possible that a non-faulty process is never suspected at all
  3. You can be confused because there's chaos but at some point non-faulty processes stop being suspected of being bad
  4. At some point there's at least one non-faulty process that is not suspected

You can possibly realize that a strong and fully accurate detector (said to be perfect) kind of implies that you get a consensus, and since consensus is not really doable in a fully asynchronous system model with failures, then there are hard limits to things you can detect reliably.

This is often why semi-synchronous system models make sense: if you treat delays greater than T to be a failure, then you can start doing adequate failure detection.

See this slide deck for a decent intro

CAP Theorem

The CAP theorem was for a long while just a conjecture, but has been proven in the early 2000s, leading to a lot of eventually consistent databases.

In A Nutshell

There are three properties to a distributed system:

  • Consistency: any time you write to a system and read back from it, you get the value you wrote or a fresher one back.
  • Availability: every request results in a response (including both reads and writes)
  • Partition tolerance: the network can lose messages

In theory, you can get a system that is both available and consistent, but only under synchronous models on a perfect network. Those don't really exist so in practice P is always there.

What the CAP theorem states is essentially that givenP, you have to choose either A (keep accepting writes and potentially corrupt data) orC (stop accepting writes to save the data, and go down).

Refinements

CAP is a bit strict in what you get in practice. Not all partitions in a network are equivalent, and not all consistency levels are the same.

Two of the most common approaches to add some flexibility to the CAP theorem are the Yield/Harvest models and PACELC.

Yield/Harvest essentially says that you can think of the system differently: yield is your ability to fulfill requests (as in uptime), and harvest is the fraction of all the potential data you can actually return. Search engines are a common example here, where they will increase their yield and answer more often by reducing their harvest when they ignore some search results to respond faster if at all.

PACELC adds the idea that eventually-consistent databases are overly strict. In case of network Partitioning you have to choose between Availability or Consistency, but Else --when the system is running normally--one has to choose between Latency and Consistency. The idea is that you can decide to degrade your consistency for availability (but only when you really need to), or you could decide to always forego consistency because you gotta go fast.

It is important to note that you cannot beat the CAP theorem (as long as you respect the models under which it was proven), and anyone claiming to do so is often a snake oil salesman.

Resources

There's been countless rehashes of the CAP theorem and various discussions over the years; the results are mathematically proven even if many keep trying to make the argument that they're so reliable it doesn't matter.

Message Passing Definitions

Messages can be sent zero or more times, in various orderings. Some terms are introduced to define what they are:

  • unicast means that the message is sent to one entity only
  • anycast means that the message is sent to any valid entity
  • broadcast means that a message is sent to all valid entities
  • atomic broadcast or total order broadcast means that all the non-faulty actors in a system receive the same messages in the same order, whichever that order is
  • gossip stands for the family of protocols where messages are forwarded between peers with the hope that eventually everyone gets all the messages
  • at least once delivery means that each message will be sent once or more; listeners are to expect to see all messages, but possibly more than once
  • at most once delivery means that each sender will only send the message one time. It's possible that listeners never see it.
  • exactly once delivery means that each message is guaranteed to be sent and seen only once. This is a nice theoretical objective but quite impossible in real systems. It ends up being simulated through other means (combining atomic broadcast with specific flags and ordering guarantees, for example)

Regarding ordering:

  • total order means that all messages have just one strict ordering and way to compare them, much like 3 is always greater than 2.
  • partial order means that some messages can compare with some messages, but not necessarily all of them. For example, I could decide that all the updates to the key k1 can be in a total order regarding each other, but independent from updates to the key k2. There is therefore a partial order between all updates across all keys, since k1 updates bear no information relative to the k2 updates.
  • causal order means that all messages that depend on other messages are received after these (you can't learn of a user's avatar before you learn about that user). It is a specific type of partial order.

There isn't a "best" ordering, each provides different possibilities and comes with different costs, optimizations, and related failure modes.

Idempotence

Idempotence is important enough to warrant its own entry. Idempotence means that when messages are seen more than once, resent or replayed, they don't impact the system differently than if they were sent just once.

Common strategies is for each message to be able to refer to previously seen messages so that you define an ordering that will prevent replaying older messages, setting unique IDs (such as transaction IDs) coupled with a store that will prevent replaying transactions, and so on.

See Idempotence is not a medical condition for a great read on it, with various related strategies.

State Machine Replication

This is a theoretical model by which, given the same sequences of states and the same operations applied to them (disregarding all kinds of non-determinism), all state machines will end up with the same result.

This model ends up being critical to most reliable systems out there, which tend to all try to replay all events to all subsystems in the same order, ensuring predictable data sets in all places.

This is generally done by picking a leader; all writes are done through the leader, and all the followers get a consistent replicated state of the system, allowing them to eventually become leaders or to fan-out their state to other actors.

State-Based Replication

State-based replication can be conceptually simpler to state-machine replication, with the idea that if you only replicate the state, you get the state at the end!

The problem is that it is extremely hard to make this fast and efficient. If your state is terabytes large, you don't want to re-send it on every operation. Common approaches will include splitting, hashing, and bucketing of data to detect changes and only send the changed bits (think of rsync), merkle trees to detect changes, or the idea of a patch to source code.

Practical Matters

Here are a bunch of resources worth digging into for various system design elements.

End-to-End Argument in System Design

Foundational practical aspect of system design for distributed systems:

  • a message that is sent is not a message that is necessarily received by the other party
  • a message that is received by the other party is not necessarily a message that is actually read by the other party
  • a message that is read by the other party is not necessarily a message that has been acted on by the other party

The conclusion is that if you want anything to be reliable, you need an end-to-end acknowledgement, usually written by the application layer.

These ideas are behind the design of TCP as a protocol, but the authors also note that it wouldn't be sufficient to leave it at the protocol, the application layer must be involved.

Fallacies of Distributed Computing

The fallacies are:

  • The network is reliable
  • Latency is zero
  • Bandwidth is infinite
  • The network is secure
  • Topology doesn't change
  • There is one administrator
  • Transport cost is zero
  • The network is homogeneous

Partial explanations on the Wiki page or full ones in the paper.

Common Practical Failure Modes

In practice, when you switch from Computer Science to Engineering, the types of faults you will find are a bit more varied, but can map to any of the theoretical models.

This section is an informal list of common sources of issues in a system. See also the CAP theorem checklist for other common cases.

Netsplit

Some nodes can talk to each other, but some nodes are unreachable to others. A common example is that a US-based network can communicate fine internally, and so could a EU-based network, but both would be unable to speak to each-other

Asymmetric Netsplit

Communication between groups of nodes is not symmetric. For example, imagine that the US network can send messages to the EU network, but the EU network cannot respond back.

This is a rarer mode when using TCP (although it has happened before), and a potentially common one when using UDP.

Split Brain

The way a lot of systems deal with failures is to keep a majority going. A split brain is what happens when both sides of a netsplit think they are the leader, and starts making conflicting decisions.

Timeouts

Timeouts are particularly tricky because they are non-deterministic. They can only be observed from one end, and you never know if a timeout that is ultimately interpreted as a failure was actually a failure, or just a delay due to networking, hardware, or GC pauses.

There are times where retransmissions are not safe if the message has already been seen (i.e. it is not idempotent), and timeouts essentially make it impossible to know if retransmission is safe to try: was the message acted on, dropped, or is it still in transit or in a buffer somewhere?

Missing Messages due to Ordering

Generally, using TCP and crashes will tend to mean that few messages get missed across systems, but frequent cases can include:

  • The node has gone down (or the software crashed) for a few seconds during which it missed a message that won't be repeated
  • The updates are received transitively across various nodes. For example, a message published by service A on a bus (whether Kafka or RMQ) can end up read, transformed or acted on and re-published by service B, and there is a possibility that service C will read B's update before A's, causing issues in causality

Clock Drift

Not all clocks on all systems are synchronized properly (even using NTP) and will go at different speeds.

Using a timestamp to sort through events is almost guaranteed to be a source of bugs, even moreso if the timestamps come from multiple computers.

The Client is Part of the System

A very common pitfall is to forget that the client that participates in a distributed system is part of it. Consistency on the server-side will not necessarily be worth much if the client can't make sense of the events or data it receives.

This is particularly insidious for database clients that do a non-idempotent transactions, time out, and have no way to know if they can try it again.

Restoring from multiple backups

A single backup is kind of easy to handle. Multiple backups run into a problem called consistent cuts (high level view) and distributed snapshots, which means that not all the backups are taken at the same time, and this introduces inconsistencies that can be construed as corrupting data.

The good news is there's no great solution and everyone suffers the same.

Consistency Models

There are dozens different levels of consistency, all of which are documented on Wikipedia, by Peter Bailis' paper on the topic, or overviewed by Kyle Kingsbury post on them

  • Linearizability means each operation appears atomic and could not have been impacted by another one, as if they all ran just one at a time. The order is known and deterministic, and a read that started after a given write had started will be able to see that data.
  • Serializability means that while all operations appear to be atomic, it makes no guarantee about which order they would have happened in. It means that some operations might start after another one and complete before it, and as long as the isolation is well-maintained, that isn't a problem.
  • Sequential consistency means that even if operations might have taken place out-of-order, they will appear as if they all happened in order
  • Causal Consistency means that only operations that have a logical dependency on each other need to be ordered amongst each other
  • Read-committed consistency means that any operation that has been committed is available for further reads in the system
  • Repeatable reads means that within a transaction, reading the same value multiple times always yields the same result
  • Read-your-writes consistency means that any write you have completed must be readable by the same client subsequently
  • Eventual Consistency is a kind of special family of consistency measures that say that the system can be inconsistent as long as it eventually becomes consistent again. Causal consistency is an example of eventual consistency.
  • Strong Eventual Consistency is like eventual consistency but demands that no conflicts can happen between concurrent updates. This is usually the land of CRDTs.

Note that while these definitions have clear semantics that academics tend to respect, they are not adopted uniformly or respected in various projects' or vendors' documentation in the industry.

Database Transaction Scopes

By default, most people assume database transactions are linearizable, and they tend not to be because that's way too slow as a default.

Each database might have different semantics, so the following links may cover the most major ones.

Be aware that while the PostgreSQL documentation is likely the clearest and most easy to understand one to introduce the topic, various vendors can assign different meanings to the same standard transaction scopes.

Logical Clocks

Those are data structures that allow to create either total or partial orderings between messages or state transitions.

Most common ones are:

  • Lamport timestamps, which are just a counter. They allow the silent undetected crushing of conflicting data
  • Vector Clocks, which contain a counter per node, incremented on each message seen. They can detect conflicts in data and on operations.
  • Version Vectors are like vector clocks, but only change the counters on state variations rather than all event seens
  • Dotted Version Vectors are fancy version vectors that allow tracking conflicts that would be perceived by the client talking to a server.
  • Interval Tree Clocks attempts to fix the issues of other clock types by requiring less space to store node-specific information and allowing a kind of built-in garbage collection. It also has one of the nicest papers ever.

CRDTs

CRDTs essentially are data structures that restrict operations that can be done such that they can never conflict, no matter which order they are done in or how concurrently this takes place.

Think of it as the specification on how someone would write a distributed redis that was never wrong, but only left maths behind.

This is still an active area of research and countless papers and variations are always coming out.

Other interesting material

The bible for putting all of these views together is Designing Data-Intensive Applications by Martin Kleppmann. Be advised however that everyone I know who absolutely loves this book are people who had a good foundation in distributed systems from reading a bunch of papers, and greatly appreciated having it all put in one place. Most people I've seen read it in book clubs with the aim get better at distributed systems still found it challenging and confusing at times, and benefitted from having someone around to whom they could ask questions in order to bridge some gaps. It is still the clearest source I can imagine for everything in one place.

Permalink

Counting Forest Fires

2024/01/26

Counting Forest Fires

Today I'm hitting my 3 years mark at Honeycomb, and so I thought I'd re-publish one of my favorite short blog posts written over there, Counting Forest Fires, which has become my go-to argument when discussing incident metrics when asked to count outages and incidents.

If you were asked to evaluate how good crews were at fighting forest fires, what metric would you use? Would you consider it a regression on your firefighters' part if you had more fires this year than the last? Would the size and impact of a forest fire be a measure of their success? Would you look for the cause—such as a person lighting it, an environmental factor, etc—and act on it? Chances are that yes, that's what you'd do.

As time has gone by, we've learned interesting things about forest fires. Smokey Bear can tell us to be careful all he wants, but sometimes there's nothing we can do about fires. Climate change creates conditions where fires are going to be more likely, intense, and uncontrollable. We constantly learn from indigenous approaches to fire management, and we now leverage prescribed burns instead of trying to prevent them all.

In short, there are limits to what individuals or teams on the ground can do, and while counting fires or their acreage can be useful to know the burden or impact they have, it isn't a legitimate measure of success. Knowing whether your firefighters or whether your prevention campaigns are useful can't rely on these high-level observations, because they'll be drowned in the noise of a messy unpredictable world.

Forest fires and tech fires: turns out they're not so different

The parallel to software is obvious: there are conditions we put in place in organizations—or the whole industry—and things we do that have greater impacts than we can account for, or that can't be countered by individual teams or practitioner actions. And if you want to improve things, there are things you can measure that are going to be more useful. The type of metric I constantly argue for is to count the things you can do, not the things you hope don't happen. You hope that forest fires don't happen, but there's only so much that prevention can do. Likewise with incidents. You want to know that your response is adequate.

You want to know about risk factors to alter behavior in the immediate, or of signals that tell you the way you run things needs to be adjusted in the long-term, for sustainability. The goal isn't to prevent all incidents, because small ones here and there are useful to prevent even greater ones or to provide practice to your teams, enough that you may want to cause controlled incidents on purpose—that's chaos engineering. You want to be able to prioritize concurrent events so that you respond where it's most worth it. You want to prevent harm to your responders, and know how to limit it as much as possible on the people they serve. You want to make sure you learn from new observations and methods, and that your practice remains current with escalating challenges.

Don't look at success/failure. It goes deeper than that.

The concepts mentioned above are things you can invest in, train people for, and create conditions that can lead to success. Counting forest fires or incidents lets you estimate how bad a given season or quarter was, but it tells you almost nothing about how good of a job you did.

It tells you about challenges, about areas where you may want to invest and pay attention, and where you may need to repair and heal—more than it tells you about successes or failures. Fires are going to happen regardless, but they can be part of a healthy ecosystem. Trying to stamp them all out may do more harm than good in the long run.

The real question is: how do you want to react? And how do you measure that?

Permalink

Elixir v1.16 released

Elixir v1.16 has just been released. 🎉

The Elixir team continues improving the developer experience via tooling, documentation, and precise feedback, while keeping the language stable and compatible.

The notable improvements in this release are the addition of compiler diagnostics and extensive improvements to our docs in the forms of guides, anti-patterns, diagrams and more.

Code snippets in diagnostics

Elixir v1.15 introduced a new compiler diagnostic format and the ability to print multiple error diagnostics per compilation (in addition to multiple warnings).

With Elixir v1.16, we also include code snippets in exceptions and diagnostics raised by the compiler, including ANSI coloring on supported terminals. For example, a syntax error now includes a pointer to where the error happened:

** (SyntaxError) invalid syntax found on lib/my_app.ex:1:17:
    error: syntax error before: '*'
    │
  1 │ [1, 2, 3, 4, 5, *]
    │                 ^
    │
    └─ lib/my_app.ex:1:17

For mismatched delimiters, it now shows both delimiters:

** (MismatchedDelimiterError) mismatched delimiter found on lib/my_app.ex:1:18:
    error: unexpected token: )
    │
  1 │ [1, 2, 3, 4, 5, 6)
    │ │                └ mismatched closing delimiter (expected "]")
    │ └ unclosed delimiter
    │
    └─ lib/my_app.ex:1:18

For unclosed delimiters, it now shows where the unclosed delimiter starts:

** (TokenMissingError) token missing on lib/my_app:8:23:
    error: missing terminator: )
    │
  1 │ my_numbers = (1, 2, 3, 4, 5, 6
    │              └ unclosed delimiter
 ...
  8 │ IO.inspect(my_numbers)
    │                       └ missing closing delimiter (expected ")")
    │
    └─ lib/my_app:8:23

Errors and warnings diagnostics also include code snippets. When possible, we will show precise spans, such as on undefined variables:

  error: undefined variable "unknown_var"
  │
5 │     a - unknown_var
  │         ^^^^^^^^^^^
  │
  └─ lib/sample.ex:5:9: Sample.foo/1

Otherwise the whole line is underlined:

error: function names should start with lowercase characters or underscore, invalid name CamelCase
  │
3 │   def CamelCase do
  │   ^^^^^^^^^^^^^^^^
  │
  └─ lib/sample.ex:3

A huge thank you to Vinícius Müller for working on the new diagnostics.

Revamped documentation

The ExDoc package provides Elixir developers with one of the most complete and robust documentation generator. It supports API references, tutorials, cheatsheets, and more.

However, because many of the language tutorials and reference documentation were written before ExDoc, they were maintained separately as part of the official website, separate from the language source code. With Elixir v1.16, we have moved our learning material to the language repository. This provides several benefits:

  1. Tutorials are versioned alongside their relevant Elixir version

  2. You get full-text search across all API reference and tutorials

  3. ExDoc will autolink module and function names in tutorials to their relevant API documentation

Another feature we have incorporated in this release is the addition of cheatsheets, starting with a cheatsheet for the Enum module. If you would like to contribute future cheatsheets to Elixir itself, feel free to start a discussion and collect feedback on the Elixir Forum.

Finally, we have started enriching our documentation with Mermaid.js diagrams. You can find examples in the GenServer and Supervisor docs.

Elixir has always been praised by its excellent documentation and we are glad to continue to raise the bar for the whole ecosystem.

Living anti-patterns reference

Elixir v1.16 incorporates and extends the work on Understanding Code Smells in Elixir Functional Language, by Lucas Vegi and Marco Tulio Valente, from ASERG/DCC/UFMG, into the official documention in the form of anti-patterns. Our goal is to provide examples of potential pitfalls for library and application developers, with additional context and guidance on how to improve their codebases.

In earlier versions, Elixir’s official reference for library authors included a list of anti-patterns for library developers. Lucas Vegi and Marco Tulio Valente extended and refined this list based on the existing literature, articles, and community input (including feedback based on their prevalence in actual codebases).

To incorporate the anti-patterns into the language, we trimmed the list down to keep only anti-patterns which are unambiguous and actionable, and divided them into four categories: code-related, design-related, process-related, and meta-programming. Then we collected more community feedback during the release candidate period, further refining and removing unclear guidance.

We are quite happy with the current iteration of anti-patterns but this is just the beginning. As they become available to the whole community, we expect to receive more input, questions, and concerns. We will continue listening and improving, as our ultimate goal is to provide a live reference that reflects the practices of the ecosystem, rather than a document that is written in stone and ultimately gets out of date. A perfect example of this is the recent addition of “Sending unnecessary data” anti-pattern, which was contributed by the community and describes a pitfall that may happen across codebases.

Type system updates

As we get Elixir v1.16 out of door, the Elixir team will focus on bringing the initial core for set-theoretic types into the Elixir compiler, with the goal of running automated analysis in patterns and guards. This is the first step outlined in a previous article and is sponsored by Fresha (they are hiring!), Starfish* (they are hiring!), and Dashbit.

Learn more

Other notable changes in this release are:

  • the addition of String.replace_invalid/2, to help deal with invalid UTF-8 encoding

  • the addition of the :limit option in Task.yield_many/2 that limits the maximum number of tasks to yield

  • improved binary pattern matching by allowing prefix binary matches, such as <<^prefix::binary, rest::binary>>

For a complete list of all changes, see the full release notes.

Check the Install section to get Elixir installed and read our Getting Started guide to learn more.

Happy learning!

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.