Erlang/OTP 27.0 Release

Erlang/OTP 27 is a new major release with new features, improvements as well as a few incompatibilities.

For details about new features, bugfixes and potential incompatibilities see the Erlang 27.0 README or the Erlang/OTP 27.0 downloads page.

Many thanks to all contributors!

Below are some of the highlights of the release:

Documentation

EEP-59 has been implemented. Documentation attributes in source files can now be used to document functions, types, callbacks, and modules.

The entire Erlang/OTP documentation is now using the new documentation system.

Building Erlang/OTP

  • configure now automatically enables support for year-2038-safe timestamps.

    By default configure scripts used when building OTP will now try to enable support for timestamps that will work after mid-January 2038. This has typically only been an issue on 32-bit platforms.

    If configure cannot figure out how to enable such timestamps, it will abort with an error message. If you want to build the system anyway, knowing that the system will not function properly after mid-January 2038, you can pass the --disable-year2038 option to configure, which will enable configure to continue without support for timestamps after mid-January 2038.

New language features

  • Triple-Quoted Strings has been implemented as per EEP 64 to allow a string to encompass a complete paragraph.

  • Adjacent string literals without intervening white space is now a syntax error, to avoid possible confusion with triple-quoted strings.

  • Sigils on string literals (both ordinary and triple-quoted) have been implemented as per EEP 66. For example, ~"Björn" or ~b"Björn" are now equivalent to <<"Björn"/utf8>>.

Compiler and JIT improvements

  • The compiler will now merge consecutive updates of the same record.

  • Safe destructive update of tuples has been implemented in the compiler and runtime system. This allows the VM to update tuples in-place when it is safe to do so, thus improving performance by doing less copying but also by producing less garbage.

  • The maybe expression is now enabled by default, eliminating the need for enabling the maybe_expr feature.

  • Native coverage support has been implemented in the JIT. It will automatically be used by the cover tool to reduce the execution overhead when running cover-compiled code. There are also new APIs to support native coverage without using the cover tool.

  • The compiler will now raise a warning when updating record/map literals to catch a common mistake. For example, the compiler will now emit a warning for #r{a=1}#r{b=2}.

ERTS

  • The erl command now supports the -S flag, which is similar to the -run flag, but with some of the rough edges filed off.

  • By default, escripts will now be compiled instead of interpreted. That means that the compiler application must be installed.

  • The default process limit has been raised to 1048576 processes.

  • The erlang:system_monitor/2 functionality is now able to monitor long message queues in the system.

  • The obsolete and undocumented support for opening a port to an external resource by passing an atom (or a string) as first argument to open_port(), implemented by the vanilla driver, has been removed. This feature has been scheduled for removal in OTP 27 since the release of OTP 26.

  • The pid field has been removed from erlang:fun_info/1,2.

  • Multiple trace sessions are now supported.

STDLIB

  • There is a new module json for encoding and decoding JSON.

    Both encoding and decoding can be customized. Decoding can be done in a SAX-like fashion and handle multiple documents and streams of data.

    The new json module is used by the jer (JSON Encoding Rules) for ASN.1 for encoding and decoding JSON. Thus, there is no longer any need to supply an external JSON library.

  • Several new functions that accept funs have been added to module timer.

  • The functions is_equal/2, map/2, and filtermap/2 have been added to the modules sets, ordsets, and gb_sets.

  • There are new efficient ets traversal functions with guaranteed atomicity. For example, ets:next/2 followed by ets:lookup/2 can now be replaced with ets:next_lookup/1.

  • The new function ets:update_element/4 is similar to ets:update_element/3, but takes a default tuple as the fourth argument, which will be inserted if no previous record with that key exists.

  • binary:replace/3,4 now supports using a fun for supplying the replacement binary.

  • The new function proc_lib:set_label/1 can be used to add a descriptive term to any process that does not have a registered name. The name will be shown by tools such as c:i/0 and observer, and it will be included in crash reports produced by processes using gen_server, gen_statem, gen_event, and gen_fsm.

  • Added functions to retrieve the next higher or lower key/element from gb_trees and gb_sets, as well as returning iterators that start at given keys/elements.

common_test

  • Calls to ct:capture_start/0 and ct:capture_stop/0 are now synchronous to ensure that all output is captured.

  • The default CSS will now include a basic dark mode handling if it is preferred by the browser.

crypto

  • The functions crypto_dyn_iv_init/3 and crypto_dyn_iv_update/3 that were marked as deprecated in Erlang/OTP 25 have been removed.

dialyzer

  • The --gui option for Dialyzer has been removed.

ssl

  • The ssl client can negotiate and handle certificate status request (OCSP stapling support on the client side).

tools

  • There is a new tool tprof, which combines the functionality of eprof and cprof under one interface. It also adds heap profiling.

xmerl

  • As an alternative to xmerl_xml, a new export module xmerl_xml_indent that provides out-of-the box indented output has been added.

For more details about new features and potential incompatibilities see the README.

Permalink

Least Privilege And Elixir Ecto

What’s Least Privilege?

Before diving in to the heart of this blog post I think it’s wise to explain what I mean by “Least Privilege” and why it’s worth while. I think it’s also worth discussing why you may not want to bother with it as well.

The “Principle of Least Privilege” basically means you enable a user to do what they need to do but nothing more. If a user only needs to create reports against a database then it’s usually acceptable to give them permissions to query the database and nothing else. You surely don’t need to give a user who needs to create reports the ability to modify the privileges of other users; that’s unnecessary and might cause big problems.

Some of you may be old enough to remember a time when the Windows operating system was plagued by lots and lots of worms and viruses. This was for the simple reason that before the Vista version of Windows, all users on the machine were effectively superusers (or root depending on one’s frame of reference). This, of course, arose from the fact that Windows was originally intended to be used by a single user on a single machine before connectivity was much of a concern. Why would you block someone from doing whatever they want on a machine if you anticipated that they’d be installing their own software? After all PC did mean personal computer right?

Of course as time progressed, computers were networked and connected to the internet. I remember a less computer-literate friend of mine that simply connected to the internet before anyone had installed a firewall for him and immediately his PC was infected with a virus. I installed a firewall on his machine and cleaned things up but he asked me “What did I do? I just checked my email?” and I had to tell him that the simple act of connecting to the internet without a firewall present was enough to expose his machine to a virus.

Now if you’re 100% sure that your application will never ever be connected to the internet, then you don’t need to worry about least privilege. But the likelihood of no connection to the internet reached zero about 20 years ago and today if it’s even less likely, if that’s possible, that a machine will never connect to the internet. Of course we’ve got firewalls, and we’ve got routers to secure things between us and the internet. But even so there are ways to hack machines which have firewalls. And once someone has gotten in to your network, you don’t want them to be able to get at all the data without some sort of validation.

Bear in mind too that keeping people from seeing sensitive data is not the only concern you should have as a developer. If someone can hack your database (or your machine) they can use it to store illicit data. And if you’re like most developers chances are you’ll store application permissions and id’s in the database itself. The wisdom of storing id’s in a database is questionable but again, this is something we need to be aware of as developers.

I may have said “you might not want to bother with least privilege” but it’s hard for me to imagine a situation where it truly isn’t something you want to make the effort to implement.

Secured In The Application Or Secured At The Database?

Another thing that a developer must consider is exactly where he or she secures the data in the database. You can, of course, secure permissions in the application. In fact this is a very common scenario and it’s usually effective and has some real advantages. It’s more likely that your developers will know the language the application is written in and therefore be better able to understand the implications of the security code. If you need to create new permissions, you can modify the source code and you don’t need to worry about the database per se.

The other option is security at the database level. This is a bit harder to create and maintain but it’s also a lot more secure. When security is enforced at the application level a knowledgeable user can circumvent the application (and its protections) and get at data which he or she should not have access to.

Securing at the database level may require more work and it might require a team to have a DBA on staff. Of course having a database specialist on a team isn’t ever a bad idea–at least in my experience. And even if you make the choice to secure the application via application code, it’s still very wise to be quite stingy with what you let users see. You’re much better to give a user too few permissions than too many.

This also conforms more closely to the idea of Zero Trust Networks. That is, no one on the network is trustworthy–including your application.

I want to emphasize that this is not an either/or choice. It’s entirely possible to have some of your security in the application and some in the database. As long as you code your application to deal with the results of attempts to do things which the user is not permitted to do you’re fine.

Having said all this Elixir (and as far as I know Ruby On Rails) favors the security in the application model. Again, there’s no reason to prefer one approach versus the other–it’s just the way that Elixir was designed to interact with Ecto. The ability to perform database migrations is a large, large advantage over the alternative and making this possible at the database level would be quite tricky.

The main reason I want to secure things at the database level is for security for publicly available database. If a database is unlikely to be publicly accessible, then there’s considerably less one needs to worry about in terms of security. I’m working on an application which will host a database publicly so my tendency to being a little paranoid has made me investigate how I can move my security to the database level.

Admittedly this can be a sort of belt and suspenders approach; that is, it’s a little redundant but it’s also more secure. It’s far closer to the underlying notion informing a zero-trust network approach.

Some Additional Preliminaries

1.) Given the prevalence of PostgreSQL in Elixir application use the code in this post will assume you’re working with a PgSQL back-end. The ideas should work equally well with other databases but figuring out the exact Data Definition Language and/or SQL needed is left as an exercise for the reader.

2.) For the purposes of this post, I’m creating a read-only (RO) user and a read-write (RW) user. There are certainly other combinations which can be created. The main point is to provide a user who has privileges tailored to the work it will need to do so that it cannot be exploited by bad actors to mess with your database.

3.) I’ve tested the RO and the RW user with various scenarios and no obvious issues have presented themselves. Be warned though that my testing has not been very extensive since I’ve been creating this implementation for a personal project so bear that in mind.

All that out of the way, let’s finally get to some code shall we?

Creating the Users

Here’s the code I’ve used to create the RO and RW users.

# Create custom users for reading data and writing data
defmodule LeastPrivilege do
  @moduledoc """
  Create database users with the least privileges necessary to perform their tasks.
  """

  @spec create_nonadmin_user(String.t(), String.t()) :: query_result.t()
  defp create_nonadmin_user(user_name, password) do
    Ecto.Adapters.SQL.query!(
      Pta.Repo,
      "CREATE USER #{user_name} WITH NOSUPERUSER NOCREATEDB NOCREATEROLE NOINHERIT LOGIN PASSWORD #{password};"
    )
  end

  @spec grant_table_privileges(String.t(), [String.t()], [String.t()]) :: query_result.t()
  defp grant_table_privileges(user_name, privileges, tables) do
    for table <- tables do
      for privilege <- privileges do
        Ecto.Adapters.SQL.query!(
          Pta.Repo,
          "GRANT #{privilege} ON TABLE #{table} TO #{user_name};"
        )
      end
    end
  end

  @spec create_query_only_user() :: query_result.t()
  def create_query_only_user do
    pta_query_password = "'" <> System.fetch_env!("PTA_QUERY_PASSWORD") <> "'"
    user_name = "pta_query"

    create_nonadmin_user(user_name, pta_query_password)
    grant_table_privileges(user_name, ["SELECT"], ["venues", "performances"])
    grant_table_privileges(user_name, ["ALL PRIVILEGES"], ["schema_migrations"])
    Ecto.Adapters.SQL.query!(Pta.Repo, "GRANT pg_read_all_data TO #{user_name};")
    Ecto.Adapters.SQL.query!(Pta.Repo, "GRANT CREATE ON SCHEMA public TO #{user_name};")
  end

  @spec create_readwrite_user() :: query_result.t()
  def create_readwrite_user do
    pta_update_password = "'" <> System.fetch_env!("PTA_UPDATE_PASSWORD") <> "'"
    user_name = "pta_update"

    create_nonadmin_user(user_name, pta_update_password)

    grant_table_privileges(user_name, ["SELECT", "INSERT", "UPDATE", "DELETE"], [
      "venues",
      "performances"
    ])

    grant_table_privileges(user_name, ["ALL PRIVILEGES"], ["schema_migrations"])
    Ecto.Adapters.SQL.query!(Pta.Repo, "GRANT pg_write_all_data TO #{user_name};")
    Ecto.Adapters.SQL.query!(Pta.Repo, "GRANT CREATE ON SCHEMA public TO #{user_name};")
  end
end

LeastPrivilege.create_query_only_user()
LeastPrivilege.create_readwrite_user()

A few notes on this code:

1.) I’ve created this as a separate module but I use it included in my seeds.exs file. This is for the simple reason that if I need to re-initialize the database (via a mix ecto.reset or a similar mechanism) I want to insure that my special least-privileged users get created as well.

2.) You may notice that all of the functions return a query_result. This is the normal behavior of the Ecto.Adapters.SQL.query function; in this case we’re not actually querying any tables but the return type is nonetheless query_result.

3.) In this particular case I’m using two tables. I’m unaware of any reason to believe that these techniques wouldn’t scale up to far more tables and/or privileges.

4.) There are a few grants which might puzzle the reader:

grant_table_privileges(user_name, ["ALL PRIVILEGES"], ["schema_migrations"])
Ecto.Adapters.SQL.query!(Pta.Repo, "GRANT CREATE ON SCHEMA public TO #{user_name};")

These privileges are a concession to the necessity of working with Ecto. I haven’t researched why they needed but I can say for certain that if I attempt to get away with not specifying both privileges I see errors from Ecto.

5.) The operations in this module must be run under a SUPERUSER connection to the database. That is you cannot create a RW user and then log in under that user and create a RO user. Both must be created while the enduser is connected via SUPERUSER. After they’re created you can use the least privileged users (see below).

Using The Least Privileged Users

Finally it’s not very difficult to actually use the least privileged users in your code. For example, if you wanted to use the RW user as your default user within the database you could modify your dev.exs config file in this fashion.

# Configure your database
config :pta, Pta.Repo,
  username: "pta_update",
  password: *** password ***,
  hostname: "localhost",
  database: "pta_dev",
  stacktrace: true,
  show_sensitive_data_on_connection_error: true,
  pool_size: 10

I haven’t done the work yet to allow me to modify the user dynamically in the code. Again, since I haven’t needed it yet, I haven’t bothered with it.

I hope this blog post helps others who might want to modify their Elixir application to push the database security to the database layer.

Permalink

Erlang/OTP 27.0 Release Candidate 3

OTP 27.0-rc3

Erlang/OTP 27.0-rc3 is the third and final release candidate before the OTP 27.0 release.

The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue at https://github.com/erlang/otp/issues or by posting to Erlang Forums.

All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-15.0-rc3/doc. You can also install the latest release using kerl like this:

kerl build 27.0-rc3 27.0-rc3.

Erlang/OTP 27 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.

Many thanks to all contributors!

Notable changes in RC3

  • The support for multiple trace sessions is now documented and ready for use.

Highlights for RC2

  • There is a new module json for encoding and decoding JSON.

    Both encoding and decoding can be customized. Decoding can be done in a SAX-like fashion and handle multiple documents and streams of data.

    The new json module is used by the jer (JSON Encoding Rules) for ASN.1 for encoding and decoding JSON. Thus, there is no longer any need to supply an external JSON library.

Other notable changes in RC2

  • The existing experimental support for archive files will be changed in a future release. The support for having an archive in an escript will remain, but the support for using archives in a release will either become more limited or completely removed.

    As of Erlang/OTP 27, the function code:lib_dir/2, the -code_path_choice flag, and using erl_prim_loader for reading members of an archive are deprecated.

    To remain compatible with future version of Erlang/OTP escript scripts that need to retrieve data files from its archive should use escript:extract/2 instead of erl_prim_loader and code:lib_dir/2.

  • The order in which the compiler looks up options has changed.

    When there is a conflict in the compiler options given in the -compile() attribute and options given to the compiler, the options given in the -compile() attribute overrides the option given to the compiler, which in turn overrides options given in the ERL_COMPILER_OPTIONS environment variable.

    Example:

    If some_module.erl has the following attribute:

    -compile([nowarn_missing_spec]).
    

    and the compiler is invoked like so:

    % erlc +warn_missing_spec some_module.erl
    

    no warnings will be issued for functions that do not have any specs.

  • configure now automatically enables support for year-2038-safe timestamps.

    By default configure scripts used when building OTP will now try to enable support for timestamps that will work after mid-January 2038. This has typically only been an issue on 32-bit platforms.

    If configure cannot figure out how to enable such timestamps, it will abort with an error message. If you want to build the system anyway, knowing that the system will not function properly after mid-January 2038, you can pass the --disable-year2038 option to configure, which will enable configure to continue without support for timestamps after mid-January 2038.

Highlights for RC1

Documentation

EEP-59 has been implemented. Documentation attributes in source files can now be used to document functions, types, callbacks, and modules.

The entire Erlang/OTP documentation is now using the new documentation system.

New language features

  • Triple-Quoted Strings has been implemented as per EEP 64 to allow a string to encompass a complete paragraph.

  • Adjacent string literals without intervening white space is now a syntax error, to avoid possible confusion with triple-quoted strings.

  • Sigils on string literals (both ordinary and triple-quoted) have been implemented as per EEP 66. For example, ~"Björn" or ~b"Björn" are now equivalent to <<"Björn"/utf8>>.

Compiler and JIT improvements

  • The compiler will now merge consecutive updates of the same record.

  • Safe destructive update of tuples has been implemented in the compiler and runtime system. This allows the VM to update tuples in-place when it is safe to do so, thus improving performance by doing less copying but also by producing less garbage.

  • The maybe expression is now enabled by default, eliminating the need for enabling the maybe_expr feature.

  • Native coverage support has been implemented in the JIT. It will automatically be used by the cover tool to reduce the execution overhead when running cover-compiled code. There are also new APIs to support native coverage without using the cover tool.

  • The compiler will now raise a warning when updating record/map literals to catch a common mistake. For example, the compiler will now emit a warning for #r{a=1}#r{b=2}.

ERTS

  • The erl command now supports the -S flag, which is similar to the -run flag, but with some of the rough edges filed off.

  • By default, escripts will now be compiled instead of interpreted. That means that the compiler application must be installed.

  • The default process limit has been raised to 1048576 processes.

  • The erlang:system_monitor/2 functionality is now able to monitor long message queues in the system.

  • The obsolete and undocumented support for opening a port to an external resource by passing an atom (or a string) as first argument to open_port(), implemented by the vanilla driver, has been removed. This feature has been scheduled for removal in OTP 27 since the release of OTP 26.

  • The pid field has been removed from erlang:fun_info/1,2.

  • Multiple trace sessions are now supported.

STDLIB

  • Several new functions that accept funs have been added to module timer.

  • The functions is_equal/2, map/2, and filtermap/2 have been added to the modules sets, ordsets, and gb_sets.

  • There are new efficient ets traversal functions with guaranteed atomicity. For example, ets:next/2 followed by ets:lookup/2 can now be replaced with ets:next_lookup/1.

  • The new function ets:update_element/4 is similar to ets:update_element/3, but takes a default tuple as the fourth argument, which will be inserted if no previous record with that key exists.

  • binary:replace/3,4 now supports using a fun for supplying the replacement binary.

  • The new function proc_lib:set_label/1 can be used to add a descriptive term to any process that does not have a registered name. The name will be shown by tools such as c:i/0 and observer, and it will be included in crash reports produced by processes using gen_server, gen_statem, gen_event, and gen_fsm.

  • Added functions to retrieve the next higher or lower key/element from gb_trees and gb_sets, as well as returning iterators that start at given keys/elements.

common_test

  • Calls to ct:capture_start/0 and ct:capture_stop/0 are now synchronous to ensure that all output is captured.

  • The default CSS will now include a basic dark mode handling if it is preferred by the browser.

crypto

  • The functions crypto_dyn_iv_init/3 and crypto_dyn_iv_update/3 that were marked as deprecated in Erlang/OTP 25 have been removed.

dialyzer

  • The --gui option for Dialyzer has been removed.

ssl

  • The ssl client can negotiate and handle certificate status request (OCSP stapling support on the client side).

tools

  • There is a new tool tprof, which combines the functionality of eprof and cprof under one interface. It also adds heap profiling.

xmerl

  • As an alternative to xmerl_xml, a new export module xmerl_xml_indent that provides out-of-the box indented output has been added.

For more details about new features and potential incompatibilities see the README.

Permalink

Erlang/OTP 27.0 Release Candidate 2

OTP 27.0-rc2

Erlang/OTP 27.0-rc2 is the second release candidate of three before the OTP 27.0 release.

The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you. We encourage users to try it out and give us feedback either by creating an issue at https://github.com/erlang/otp/issues or by posting to Erlang Forums.

All artifacts for the release can be downloaded from the Erlang/OTP Github release and you can view the new documentation at https://erlang.org/documentation/doc-15.0-rc2/doc. You can also install the latest release using kerl like this:

kerl build 27.0-rc2 27.0-rc2.

Erlang/OTP 27 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new features are highlighted below.

Many thanks to all contributors!

Highlights for RC2

  • There is a new module json for encoding and decoding JSON.

    Both encoding and decoding can be customized. Decoding can be done in a SAX-like fashion and handle multiple documents and streams of data.

    The new json module is used by the jer (JSON Encoding Rules) for ASN.1 for encoding and decoding JSON. Thus, there is no longer any need to supply an external JSON library.

Other notable changes in RC2

  • The existing experimental support for archive files will be changed in a future release. The support for having an archive in an escript will remain, but the support for using archives in a release will either become more limited or completely removed.

    As of Erlang/OTP 27, the function code:lib_dir/2, the -code_path_choice flag, and using erl_prim_loader for reading members of an archive are deprecated.

    To remain compatible with future version of Erlang/OTP escript scripts that need to retrieve data files from its archive should use escript:extract/2 instead of erl_prim_loader and code:lib_dir/2.

  • The order in which the compiler looks up options has changed.

    When there is a conflict in the compiler options given in the -compile() attribute and options given to the compiler, the options given in the -compile() attribute overrides the option given to the compiler, which in turn overrides options given in the ERL_COMPILER_OPTIONS environment variable.

    Example:

    If some_module.erl has the following attribute:

    -compile([nowarn_missing_spec]).
    

    and the compiler is invoked like so:

    % erlc +warn_missing_spec some_module.erl
    

    no warnings will be issued for functions that do not have any specs.

  • configure now automatically enables support for year-2038-safe timestamps.

    By default configure scripts used when building OTP will now try to enable support for timestamps that will work after mid-January 2038. This has typically only been an issue on 32-bit platforms.

    If configure cannot figure out how to enable such timestamps, it will abort with an error message. If you want to build the system anyway, knowing that the system will not function properly after mid-January 2038, you can pass the --disable-year2038 option to configure, which will enable configure to continue without support for timestamps after mid-January 2038.

Highlights for RC1

Documentation

EEP-59 has been implemented. Documentation attributes in source files can now be used to document functions, types, callbacks, and modules.

The entire Erlang/OTP documentation is now using the new documentation system.

New language features

  • Triple-Quoted Strings has been implemented as per EEP 64 to allow a string to encompass a complete paragraph.

  • Adjacent string literals without intervening white space is now a syntax error, to avoid possible confusion with triple-quoted strings.

  • Sigils on string literals (both ordinary and triple-quoted) have been implemented as per EEP 66. For example, ~"Björn" or ~b"Björn" are now equivalent to <<"Björn"/utf8>>.

Compiler and JIT improvements

  • The compiler will now merge consecutive updates of the same record.

  • Safe destructive update of tuples has been implemented in the compiler and runtime system. This allows the VM to update tuples in-place when it is safe to do so, thus improving performance by doing less copying but also by producing less garbage.

  • The maybe expression is now enabled by default, eliminating the need for enabling the maybe_expr feature.

  • Native coverage support has been implemented in the JIT. It will automatically be used by the cover tool to reduce the execution overhead when running cover-compiled code. There are also new APIs to support native coverage without using the cover tool.

  • The compiler will now raise a warning when updating record/map literals to catch a common mistake. For example, the compiler will now emit a warning for #r{a=1}#r{b=2}.

ERTS

  • The erl command now supports the -S flag, which is similar to the -run flag, but with some of the rough edges filed off.

  • By default, escripts will now be compiled instead of interpreted. That means that the compiler application must be installed.

  • The default process limit has been raised to 1048576 processes.

  • The erlang:system_monitor/2 functionality is now able to monitor long message queues in the system.

  • The obsolete and undocumented support for opening a port to an external resource by passing an atom (or a string) as first argument to open_port(), implemented by the vanilla driver, has been removed. This feature has been scheduled for removal in OTP 27 since the release of OTP 26.

  • The pid field has been removed from erlang:fun_info/1,2.

  • Multiple trace sessions are now supported.

STDLIB

  • Several new functions that accept funs have been added to module timer.

  • The functions is_equal/2, map/2, and filtermap/2 have been added to the modules sets, ordsets, and gb_sets.

  • There are new efficient ets traversal functions with guaranteed atomicity. For example, ets:next/2 followed by ets:lookup/2 can now be replaced with ets:next_lookup/1.

  • The new function ets:update_element/4 is similar to ets:update_element/3, but takes a default tuple as the fourth argument, which will be inserted if no previous record with that key exists.

  • binary:replace/3,4 now supports using a fun for supplying the replacement binary.

  • The new function proc_lib:set_label/1 can be used to add a descriptive term to any process that does not have a registered name. The name will be shown by tools such as c:i/0 and observer, and it will be included in crash reports produced by processes using gen_server, gen_statem, gen_event, and gen_fsm.

  • Added functions to retrieve the next higher or lower key/element from gb_trees and gb_sets, as well as returning iterators that start at given keys/elements.

common_test

  • Calls to ct:capture_start/0 and ct:capture_stop/0 are now synchronous to ensure that all output is captured.

  • The default CSS will now include a basic dark mode handling if it is preferred by the browser.

crypto

  • The functions crypto_dyn_iv_init/3 and crypto_dyn_iv_update/3 that were marked as deprecated in Erlang/OTP 25 have been removed.

dialyzer

  • The --gui option for Dialyzer has been removed.

ssl

  • The ssl client can negotiate and handle certificate status request (OCSP stapling support on the client side).

tools

  • There is a new tool tprof, which combines the functionality of eprof and cprof under one interface. It also adds heap profiling.

xmerl

  • As an alternative to xmerl_xml, a new export module xmerl_xml_indent that provides out-of-the box indented output has been added.

For more details about new features and potential incompatibilities see the README.

Permalink

A Commentary on Defining Observability

2024/03/19

A Commentary on Defining Observability

Recently, Hazel Weakly has published a great article titled Redefining Observability. In it, she covers competing classical definitions observability, weaknesses they have, and offers a practical reframing of the concept in the context of software organizations (well, not only software organizations, but the examples tilt that way).

I agree with her post in most ways, and so this blog post of mine is more of an improv-like “yes, and…” response to it, in which I also try to frame the many existing models as complementary or contrasting perspectives of a general concept.

The main points I’ll try to bring here are on the topics of the difference between insights and questions, the difference between observability and data availability, reinforcing a socio-technical definition, the mess of complex systems and mapping them, and finally, a hot take on the use of models when reasoning about systems.

Insights and Questions

The control theory definition of observability, from Rudolf E. Kálmán, goes as follows:

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

The one from Woods and Hollnagel in cognitive engineering goes like this:

Observability is feedback that provides insight into a process and refers to the work needed to extract meaning from available data.

Hazel version’s, by comparison, is:

Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.

While all three definitions relate to extracting information from a system, Hazel’s definition sits at a higher level by specifically mentioning questions and actions. It’s a more complete feedback loop including some people, their mental models, and seeking to enrich them or use them.

I think that higher level ends up erasing a property of observability in the other definitions: it doesn’t have to be inquisitive nor analytical.

Let’s take a washing machine, for example. You can know whether it is being filled or spinning by sound. The lack of sound itself can be a signal about whether it is running or not. If it is overloaded, it might shake a lot and sound out of balance during the spin cycle. You don’t necessarily have to be in the same room as the washing machine nor paying attention to it to know things about its state, passively create an understanding of normalcy, and learn about some anomalies in there.

Another example here would be something as simple as a book. If you’re reading a good old paper book, you know you’re nearing the end of it just by how the pages you have read make a thicker portion of the book than those you haven’t read yet. You do not have to think about it, the information is inherent to the medium. An ebook read on an electronic device, however, will hide that information unless a design decision is made to show how many lines or words have been read, display a percentage, or a time estimate of the content left.

Observability for the ebook isn’t innate to its structure and must be built in deliberately. Similarly, you could know old PCs were doing heavy work if the disk was making noise and when the fan was spinning up; it is not possible to know as much on a phone or laptop that has an SSD and no fan unless someone builds a way to expose this data.

Associations and patterns can be formed by the observer in a way that provides information and explanations, leading to effective action and management of the system in play. It isn’t something always designed or done on purpose, but it may need to be.

The key element is that an insight can be obtained without asking questions. In fact, a lot of anomaly detection is done passively, by the observer having a sort of mental construct of what normal is that lets them figure out what should happen next—what the future trajectory of the system is—and to then start asking questions when these expectations are not met. The insights, therefore, can come before the question is asked. Observability can be described as a mechanism behind this.

I don’t think that this means Hazel’s definition is wrong; I think it might just be a consequence of the scale at which her definition operates. However, this distinction is useful for a segue into the difference between data availability and observability.

The difference between observability and data availability

For a visual example, I’ll use variations on a painting from 1866 named In a Roman Osteria, by Danish painter Carl Bloch:

two versions of the painting are shown. On the left, only a rough black and white, handdrawn outline showing 3 characters in the back, a cat on a bench next to the three main characters eating a meal. The second image on the right is just the original image cut up in shuffled pieces of a jigsaw puzzle

The first one is an outline, and the second is a jigsaw puzzle version (with all the pieces are right side up with the correct orientation, at least).

The jigsaw puzzle has 100% data availability. All of the information is there and you can fully reconstruct the initial painting. The outlined version has a lot less data available, but if you’ve never seen the painting before, you will get a better understanding from it in absolutely no time compared to the jigsaw.

This “make the jigsaw show you what you need faster” approach is more or less where a lot of observability vendors operate: the data is out there, you need help to store it and put it together and extract the relevancy out of it:

the original In a Roman Osteria painting

What this example highlights though is that while you may get better answers with richer and more accurate data (given enough time, tools, and skill), the outline is simpler and may provide adequate information with less effort required from the observer. Effective selection of data, presented better, may be more appropriate during high-tempo, tense situations where quick decisions can make a difference.

This, at least, implies that observability is not purely a data problem nor a tool problem (which lines up, again, with what Hazel said in her post). However, it hints at it being a potential design problem. The way data is presented, the way affordances are added, and whether the system is structured in a way that makes storytelling possible all can have a deep impact in how observable it turns out to be.

Sometimes, coworkers mention that some of our services are really hard to interpret even when using Honeycomb (which we build and operate!) My theory about that is that too often, the data we output for observability is structured with the assumption that the person looking at it will have the code on hand—as the author did when writing it—and will be able to map telemetry data to specific areas of code.

So when you’re in there writing queries and you don’t know much about the service, the traces mean little. As a coping mechanism, social patterns emerge where data that is generally useful is kept on some specific spans that are considered important, but that you can only find if someone more familiar with this area explained where it was to you already. It draws into pre-existing knowledge of the architecture, of communication patterns, of goals of the application that do not live within the instrumentation.

Traces that are easier to understand and explore make use of patterns that are meaningful to the investigator, regardless of their understanding of the code. For the more “understandable” telemetry data, the naming, structure, and level of detail are more related to how the information is to be presented than the structure of the underlying implementation.

Observability requires interpretation, and interpretation sits in the observer. What is useful or not will be really hard to predict, and people may find patterns and affordances in places that weren’t expected or designed, but still significant. Catering to this property requires taking a perspective of the system that is socio-technical.

The System is Socio-Technical

Once again for this section, I agree with Hazel on the importance of people in the process. She has lots of examples of good questions that exemplify this. I just want to push even harder here.

Most of the examples I’ve given so far were technical: machines and objects whose interpretation is done by humans. Real complex systems don’t limit themselves to technical components being looked at; people are involved, talking to each other, making decisions, and steering the overall system around.

This is nicely represented by the STELLA Report’s Above/Below the line diagram:

figure 4 from the report showing the

The continued success of the overall system does not purely depend on the robustness of technical components and their ability to withstand challenges for which they were designed. When challenges beyond what was planned for do happen, and when the context in which the system operates changes (whether it is due to competition, legal frameworks, pandemics, or evolving user needs and tastes), adjustments need to be made to adapt the system and keep it working.

The adaptation is not done purely on a technical level, by fixing and changing the software and hardware, but also by reconfiguring the organization, by people learning new things, by getting new or different people in the room, by reframing the situation, and by steering things in a new direction. There is a constant gap to bridge between a solution and its context, and the ability to anticipate these challenges, prepare for them, and react to them can be informed by observability.

Observability at the technical level (“instrument the code and look at it with tools”) is covered by all definitions of observability in play here, but I want to really point out that observability can go further.

If you reframe your system as properly socio-technical, then yes you will need technical observability interpreted at the social level. But you may also need social observability handled at the social level: are employees burning out? Do we have the psychological safety required to learn from events? Do I have silos of knowledge that render my organization brittle? What are people working on? Where is the market at right now? Are our users leaving us for competition? Are our employees leaving us for competitions? How do we deal with a fast-moving space with limited resources?

There are so many ways for an organization to fail that aren’t technical, and ideally we’d also keep an eye on them. A definition of observability that is technical in nature can set artificial boundaries to your efforts to gain insights from ongoing processes. I believe Hazel’s definition maps to this ideal more clearly than the cognitive engineering one, but I want to re-state the need to avoid strictly framing its application to technical components observed by people.

A specific dynamic I haven’t mentioned here—and this is something disciplines like cybernetics, cognitive engineering, and resilience engineering all have interests for—is one where the technical elements of the system know about the social elements of the system. We essentially do not currently have automation (nor AI) sophisticated enough to be good team members.

For example, while I can detect a coworker is busy managing a major outage or meeting with important customers in the midst of a contract renewal, pretty much no alerting system will be able to find that information and decide to ask for assistance from someone else who isn’t as busy working on high-priority stuff within the system. The ability of one agent to shape their own work based on broader objectives than their private ones is something that requires being able to observe other agents in the system, map that to higher goals, and shape their own behaviour accordingly.

Ultimately, a lot of decisions are made through attempts at steering the system or part of it in a given direction. This needs some sort of [mental] map of the relationships in the system, and an even harder thing to map out is the impact of having this information will have on the system itself.

Complex Systems Mapping Themselves

Recently I was at work trying to map concepts about reliability, and came up with this mess of a diagram showing just a tiny portion of what I think goes into influencing system reliability (the details are unimportant):

a very messy set of bubbles and arrows, provided with no explanations

In this concept map, very few things are purely technical; lots of work is social, process-driven, and is about providing feedback loops. As the system grows more complex, analysis and control lose some power, and sense-making and influence become more applicable.

The overall system becomes unknowable, and nearly impossible to map out—by the time the map is complete, it’s already outdated. On top of that, the moment the above map becomes used to make decisions, its own influence might need to become part of itself, since it has entered the feedback loop of how decisions are made. These things can’t be planned out, and sometimes can only be discovered in a timely manner by acting.

Basically, the point here is that not everything is observable via data availability and search. Some questions you have can only be answered by changing the system, either through adding new data, or by extracting the data through probing of the system. Try a bunch of things and look at the consequences.

A few years ago, I was talking with David Woods (to be exact, he was telling me useful things and I was listening) and he compared complex socio-technical systems to a messy web; everything is somehow connected to everything, and it’s nearly impossible to just keep track of all the relevant connections in your head. Things change and some elements will be more relevant today than they were yesterday. As we walk the web, we rediscover connections that are important, some that stopped being so significant, and so on.

Experimental practices like chaos engineering or fault injection aren’t just about testing behaviour for success and failure, they are also about deliberately exploring the connections and parts of the web we don’t venture into as often as we’d need to in order to maintain a solid understanding of it.

One thing to keep in mind is that the choice of which experiment to run is also based on the existing map and understanding of situations and failures that might happen. There is a risk in the planners and decision-makers not considering themselves to be part of the system they are studying, and of ignoring their own impact and influence.

This leads to elements such as pressures, goal conflicts, and adaptations to them, which may tend to only become visible during incidents. The framing of what to investigate, how to investigate it, how errors are constructed, which questions are worth asking or not worth asking all participate to the weird complex feedback loop within the big messy systems we’re in. The tools required for that level of analysis are however very, very different from what most observability vendors provide, and are generally never marketed as such, which does tie back on Hazel’s conclusion that “observability is organizational learning.”

A Hot Take on the Use of Models

Our complex socio-technical systems aren’t closed systems. Competitors exist; employees bring in their personal life into their work life; the environment and climate in which we live plays a role. It’s all intractable, but simplified models help make bits of it all manageable, at least partially. A key criterion is knowing when a model is worth using and when it is insufficient.

Hazel Weakly again hints at this when she states:

  • The control theory version gives you a way to know whether a system is observable or not, but it ignores the people and gives you no way to get there
  • The cognitive engineering version is better, but doesn’t give you a “why” you should care, nor any idea of where you are and where to go
  • Her version provides a motivation and a sense of direction as a process

I don’t think these versions are in conflict. They are models, and models have limits and contextual uses. In general I’d like to reframe these models as:

  • What data may be critical to provide from a technical component’s point of view (control theory model)
  • How people may process the data and find significance in it (cognitive engineering model)
  • How organizations should harness the mechanism as a feedback loop to learn and improve (Hazel Weakly’s model)

They work on different concerns by picking a different area of focus, and therefore highlight different parts of the overall system. It’s a bit like how looking at phenomena at a human scale with your own eyes, at a micro scale with a microscope, and at an astronomical scale with a telescope, all provide you with useful information despite operating at different levels. While the astronomical scale tools may not bear tons of relevance at the microscopic scale operations, they can all be part of the same overall search of understanding.

Much like observability can be improved despite having less data if it is structured properly, a few simpler models can let you make better decisions in the proper context.

My hope here was not to invalidate anything Hazel posted, but to keep the validity and specificity of the other models through additional contextualization.

Permalink

Scaling a streaming service to hundreds of thousands of concurrent viewers at Veeps

Welcome to our series of case studies about companies using Elixir in production.

Veeps is a streaming service that offers direct access to live and on-demand events by award-winning artists at the most iconic venues. Founded in 2018, it became part of Live Nation Entertainment in 2021.

Veeps have been named one of the ten most innovative companies in music and nominated for an Emmy. They currently hold the Guinness World Record for the world’s largest ticketed livestream by a solo male artist—a performance where Elixir and Phoenix played an important role in the backend during the streaming.

This case study examines how Elixir drove Veeps’ technical transformation, surpassing high-scale demands while keeping the development team engaged and productive.

The challenge: scaling to hundreds of thousands of simultaneous users

Imagine you are tasked with building a system that can livestream a music concert to hundreds of thousands of viewers around the world at the same time.

In some cases, users must purchase a ticket before the concert can be accessed. For a famous artist, it’s not uncommon to see thousands of fans continuously refreshing their browsers and attempting to buy tickets within the first few minutes of the announcement.

The Veeps engineering team needed to handle both challenges.

Early on, the Veeps backend was implemented in Ruby on Rails. Its first version could handle a few thousand simultaneous users watching a concert without any impact to stream quality, which was fine when you have a handful of shows but would be insufficient with the expected show load and massive increase in concurrent viewership across streams. It was around that time that Vincent Franco joined Veeps as their CTO.

Vincent had an extensive background in building and maintaining ticketing and event management software at scale. So, he used that experience to further improve the system to handle tens of thousands of concurrent users. However, it became clear that improving it to hundreds of thousands would be a difficult challenge, requiring substantial engineering efforts and increased operational costs. The team began evaluating other stacks that could provide the out-of-the-box tooling for scaling in order to reach both short and long-term goals.

Adopting Elixir, hiring, and rewriting the system

Vincent, who had successfully deployed Elixir as part of high-volume systems in the past, believed Elixir was an excellent fit for Veeps’ requirements.

Backed by his experience and several case studies from the Elixir community, such as the one from Discord, Vincent convinced management that Elixir could address their immediate scaling needs and become a reliable foundation on which the company could build.

With buy-in from management, the plan was set in motion. They had two outstanding goals:

  • Prepare the platform to welcome the most famous artists in the world.
  • Build their own team of engineers to help innovate and evolve the product.

Vincent knew that hiring right-fit technical people can take time and he didn’t want to rush the process. Hence, he hired DockYard to rebuild the system while simultaneously searching for the right candidates to build out the team.

Eight months later, the system had been entirely rewritten in Elixir and Phoenix. Phoenix Channels were used to enrich the live concert experience, while Phoenix LiveView empowered the ticket shopping journey.

The rewrite was put to the test shortly after with a livestream that remains one of Veeps’ biggest, still to this day. Before the rewrite, 20 Rails nodes were used during big events, whereas now, the same service requires only 2 Elixir nodes. And the new platform was able to handle 83x more concurrent users than the previous system.

The increase in infrastructure efficiency significantly reduced the need for complex auto-scaling solutions while providing ample capacity to handle high-traffic spikes.

The rewrite marked the most extensive and intricate system migration in my career, and yet, it was also the smoothest.

- Vincent Franco, CTO

This was a big testament to Elixir and Phoenix’s scalability and gave the team confidence that they made the right choice.

By the time the migration was completed, Veeps had also assembled an incredibly talented team of two backend and two frontend engineers, which continued to expand and grow the product.

Perceived benefits of using Elixir and its ecosystem

After using Elixir for more than two years, Veeps has experienced significant benefits. Here are a few of them.

Architectural simplicity

Different parts of the Veeps system have different scalability requirements. For instance, when streaming a show, the backend receives metadata from users’ devices every 30 seconds to track viewership. This is the so-called Beaconing service.

Say you have 250,000 people watching a concert: the Beaconing service needs to handle thousands of requests per second for a few hours at a time. As a result, it needs to scale differently from other parts of the system, such as the merchandise e-commerce or backstage management.

To tackle this issue, they built a distributed system. They packaged each subsystem as an Elixir release, totaling five releases. For the communication layer, they used distributed Erlang, which is built into Erlang/OTP, allowing seamless inter-process communication across networked nodes.

In a nutshell, each node contains several processes with specific responsibilities. Each of these processes belongs to their respective distributed process group. If node A needs billing information, it will reach out to any process within the “billing process group”, which may be anywhere in the cluster.

When deploying a new version of the system, they deploy a new cluster altogether, with all five subsystems at once. Given Elixir’s scalability, the whole system uses 9 nodes, making a simple deployment strategy affordable and practical. As we will see, this approach is well-supported during development too, thanks to the use of Umbrella Projects.

Service-oriented architecture within a monorepo

Although they run a distributed system, they organize the code in only one repository, following the monorepo approach. To do that, they use the Umbrella Project feature from Mix, the build tool that ships with Elixir.

Their umbrella project consists of 16 applications (at the time of writing), which they sliced into five OTP releases. The remaining applications contain code that needs to be shared between multiple applications. For example, one of the shared applications defines all the structs sent as messages across the subsystems, guaranteeing that all subsystems use the same schemas for that exchanged data.

With umbrella projects, you can have the developer experience benefits of a single code repository, while being able to build a service-oriented architecture.

- Andrea Leopardi, Principal Engineer

Reducing complexity with the Erlang/Elixir toolbox

Veeps has an e-commerce platform that allows concert viewers to purchase artist merchandise. In e-commerce, a common concept is a shopping cart. Veeps associates each shopping cart as a GenServer, which is a lightweight process managed by the Erlang VM.

This decision made it easier for them to implement other business requirements, such as locking the cart during payments and shopping cart expiration. Since each cart is a process, the expiration is as simple as sending a message to a cart process based on a timer, which is easy to do using GenServers.

For caching, the team relies on ETS (Erlang Term Storage), a high-performing key-value store part of the Erlang standard library. For cache busting between multiple parts of the distributed system, they use Phoenix PubSub, a real-time publisher/subscriber library that comes with Phoenix.

Before the rewrite, the Beaconing service used Google’s Firebase. Now, the system uses Broadway to ingest data from hundreds of thousands of HTTP requests from concurrent users. Broadway is an Elixir library for building concurrent data ingestion and processing pipelines. They utilized the library’s capabilities to efficiently send requests to AWS services, regulating batch sizes to comply with AWS limits. They also used it to handle rate limiting to adhere to AWS service constraints. All of this was achieved with Broadway’s built-in functionality.

Finally, they use Oban, an Elixir library for background jobs, for all sorts of background-work use cases.

Throughout the development journey, Veeps consistently found that Elixir and its ecosystem had built-in solutions for their technical challenges. Here’s what Vincent, CTO of Veeps, had to say about that:

Throughout my career, I’ve worked with large-scale systems at several companies. However, at Veeps, it’s unique because we achieve this scale with minimal reliance on external tools. It’s primarily just Elixir and its ecosystem that empower us.

- Vincent Franco, CTO

This operational simplicity benefitted not only the production environment but also the development side. The team could focus on learning Elixir and its ecosystem without the need to master additional technologies, resulting in increased productivity.

LiveView: simplifying the interaction between front-end and back-end developers

After the rewrite, LiveView, a Phoenix library for building interactive, real-time web apps, was used for every part of the front-end except for the “Onstage” subsystem (responsible for the live stream itself).

The two front-end developers, who came from a React background, also started writing LiveView. After this new experience, the team found the process of API negotiation between the front-end and back-end engineers much simpler compared to when using React. This was because they only had to use Elixir modules and functions instead of creating additional HTTP API endpoints and all the extra work that comes with them, such as API versioning.

Our front-end team, originally proficient in React, has made a remarkable transition to LiveView. They’ve wholeheartedly embraced its user-friendly nature and its smooth integration into our system.

- Vincent Franco, CTO

Conclusion: insights from Veeps’ Elixir experience

The decision to use Elixir has paid dividends beyond just system scalability. The team, with varied backgrounds in Java, PHP, Ruby, Python, and Javascript, found Elixir’s ecosystem to be a harmonious balance of simplicity and power.

By embracing Elixir’s ecosystem, including Erlang/OTP, Phoenix, LiveView, and Broadway, they built a robust system, eliminated the need for numerous external dependencies, and kept productively developing new features.

Throughout my career, I’ve never encountered a developer experience as exceptional as this. Whether it’s about quantity or complexity, tasks seem to flow effortlessly. The team’s morale is soaring, everyone is satisfied, and there’s an unmistakable atmosphere of positivity. We’re all unequivocally enthusiastic about this language.

- Vincent Franco, CTO

Veeps’ case illustrates how Elixir effectively handles high-scale challenges while keeping the development process straightforward and developer-friendly.

Permalink

A Distributed Systems Reading List

2024/02/07

A Distributed Systems Reading List

This document contains various resources and quick definition of a lot of background information behind distributed systems. It is not complete, even though it is kinda sorta detailed. I had written it some time in 2019 when coworkers at the time had asked for a list of references, and I put together what I thought was a decent overview of the basics of distributed systems literature and concepts.

Since I was asked for resources again recently, I decided to pop this text into my blog. I have verified the links again and replaced those that broke with archive links or other ones, but have not sought alternative sources when the old links worked, nor taken the time to add any extra content for new material that may have been published since then.

It is meant to be used as a quick reference to understand various distsys discussions, and to discover the overall space and possibilities that are around this environment.

Foundational theory

This is information providing the basics of all the distsys theory. Most of the papers or resources you read will make references to some of these concepts, so explaining them makes sense.

Models

In a Nutshell

There are three model types used by computer scientists doing distributed system theory:

  1. synchronous models
  2. semi-synchronous models
  3. asynchronous models

A synchronous model means that each message sent within the system has a known upper bound on communications (max delay between a message being sent and received) and the processing speed between nodes or agents. This means that you can know for sure that after a period of time, a message was missed. This model is applicable in rare cases, such as hardware signals, and is mostly beginner mode for distributed system proofs.

An asynchronous model means that you have no upper bound. It is legit for agents and nodes to process and delay things indefinitely. You can never assume that a "lost" message you haven't seen for the last 15 years won't just happen to be delivered tomorrow. The other node can also be stuck in a GC loop that lasts 500 centuries, that's good.

Proving something works on asynchronous model means it works with all other types. This is expert mode for proofs and is even trickier than real world implementations to make work in most cases.

The Semi-synchronous models are the cheat mode for real world. There are upper-bounds to the communication mechanisms and nodes everywhere, but they are often configurable and unspecified. This is what lets a protocol designer go "you know what, we're gonna stick a ping message in there, and if you miss too many of them we consider you're dead."

You can't assume all messages are delivered reliably, but you give yourself a chance to say "now that's enough, I won't wait here forever."

Protocols like Raft, Paxos, and ZAB (quorum protocols behind etcd, Chubby, and ZooKeeper respectively) all fit this category.

Theoretical Failure Modes

The way failures happen and are detected is important to a bunch of algorithms. The following are the most commonly used ones:

  1. Fail-stop failures
  2. Crash failures
  3. Omission failures
  4. Performance failures
  5. Byzantine failures

First, Fail-stop failures mean that if a node encounters a problem, everyone can know about it and detect it, and can restore state from stable storage. This is easy mode for theory and protocols, but super hard to achieve in practice (and in some cases impossible)

Crash failures mean that if a node or agent has a problem, it crashes and then never comes back. You are either correct or late forever. This is actually easier to design around than fail-stop in theory (but a huge pain to operate because redundancy is the name of the game, forever).

Omission failures imply that you give correct results that respect the protocol or never answer.

Performance failures assumes that while you respect the protocol in terms of the content of messages you send, you will also possibly send results late.

Byzantine failures means that anything can go wrong (including people willingly trying to break you protocol with bad software pretending to be good software). There's a special class of authentication-detectable byzantine failures which at least put the constraint that you can't forge other messages from other nodes, but that is an optional thing. Byzantine modes are the worst.

By default, most distributed system theory assumes that there are no bad actors or agents that are corrupted and willingly trying to break stuff, and byzantine failures are left up to blockchains and some forms of package management.

Most modern papers and stuff will try and stick with either crash or fail-stop failures since they tend to be practical.

See this typical distsys intro slide deck for more details.

Consensus

This is one of the core problems in distributed systems: how can all the nodes or agents in a system agree on one value? The reason it's so important is that if you can agree on just one value, you can then do a lot of things.

The most common example of picking a single very useful value is the name of an elected leader that enforces decisions, just so you can stop having to build more consensuses because holy crap consensuses are painful.

Variations exist on what exactly is a consensus, including does everyone agree fully? (strong) or just a majority? (t-resilient) and asking the same question in various synchronicity or failure models.

Note that while classic protocols like Paxos use a leader to ensure consistency and speed up execution while remaining consistent, a bunch of systems will forgo these requirements.

FLP Result

In A Nutshell

Stands for Fischer-Lynch-Patterson, the authors of a 1985 paper that states that proper consensus where all participants agree on a value is unsolvable in a purely asynchronous model (even though it is in a synchronous model) as long as any kind of failure is possible, even if they're just delays.

It's one of the most influential papers in the arena because it triggered a lot of other work for other academics to define what exactly is going on in distributed systems.

Detailed reading

Fault Detection

Following FLP results, which showed that failure detection was kind of super-critical to making things work, a lot of computer science folks started working on what exactly it means to detect failures.

This stuff is hard and often much less impressive than we'd hope for it to be. There are strong and weak fault detectors. The former implies all faulty processes are eventually identified by all non-faulty ones, and the latter that only some non-faulty processes find out about faulty ones.

Then there are degrees of accuracy:

  1. Nobody who has not crashed is suspected of being crashed
  2. It's possible that a non-faulty process is never suspected at all
  3. You can be confused because there's chaos but at some point non-faulty processes stop being suspected of being bad
  4. At some point there's at least one non-faulty process that is not suspected

You can possibly realize that a strong and fully accurate detector (said to be perfect) kind of implies that you get a consensus, and since consensus is not really doable in a fully asynchronous system model with failures, then there are hard limits to things you can detect reliably.

This is often why semi-synchronous system models make sense: if you treat delays greater than T to be a failure, then you can start doing adequate failure detection.

See this slide deck for a decent intro

CAP Theorem

The CAP theorem was for a long while just a conjecture, but has been proven in the early 2000s, leading to a lot of eventually consistent databases.

In A Nutshell

There are three properties to a distributed system:

  • Consistency: any time you write to a system and read back from it, you get the value you wrote or a fresher one back.
  • Availability: every request results in a response (including both reads and writes)
  • Partition tolerance: the network can lose messages

In theory, you can get a system that is both available and consistent, but only under synchronous models on a perfect network. Those don't really exist so in practice P is always there.

What the CAP theorem states is essentially that givenP, you have to choose either A (keep accepting writes and potentially corrupt data) orC (stop accepting writes to save the data, and go down).

Refinements

CAP is a bit strict in what you get in practice. Not all partitions in a network are equivalent, and not all consistency levels are the same.

Two of the most common approaches to add some flexibility to the CAP theorem are the Yield/Harvest models and PACELC.

Yield/Harvest essentially says that you can think of the system differently: yield is your ability to fulfill requests (as in uptime), and harvest is the fraction of all the potential data you can actually return. Search engines are a common example here, where they will increase their yield and answer more often by reducing their harvest when they ignore some search results to respond faster if at all.

PACELC adds the idea that eventually-consistent databases are overly strict. In case of network Partitioning you have to choose between Availability or Consistency, but Else --when the system is running normally--one has to choose between Latency and Consistency. The idea is that you can decide to degrade your consistency for availability (but only when you really need to), or you could decide to always forego consistency because you gotta go fast.

It is important to note that you cannot beat the CAP theorem (as long as you respect the models under which it was proven), and anyone claiming to do so is often a snake oil salesman.

Resources

There's been countless rehashes of the CAP theorem and various discussions over the years; the results are mathematically proven even if many keep trying to make the argument that they're so reliable it doesn't matter.

Message Passing Definitions

Messages can be sent zero or more times, in various orderings. Some terms are introduced to define what they are:

  • unicast means that the message is sent to one entity only
  • anycast means that the message is sent to any valid entity
  • broadcast means that a message is sent to all valid entities
  • atomic broadcast or total order broadcast means that all the non-faulty actors in a system receive the same messages in the same order, whichever that order is
  • gossip stands for the family of protocols where messages are forwarded between peers with the hope that eventually everyone gets all the messages
  • at least once delivery means that each message will be sent once or more; listeners are to expect to see all messages, but possibly more than once
  • at most once delivery means that each sender will only send the message one time. It's possible that listeners never see it.
  • exactly once delivery means that each message is guaranteed to be sent and seen only once. This is a nice theoretical objective but quite impossible in real systems. It ends up being simulated through other means (combining atomic broadcast with specific flags and ordering guarantees, for example)

Regarding ordering:

  • total order means that all messages have just one strict ordering and way to compare them, much like 3 is always greater than 2.
  • partial order means that some messages can compare with some messages, but not necessarily all of them. For example, I could decide that all the updates to the key k1 can be in a total order regarding each other, but independent from updates to the key k2. There is therefore a partial order between all updates across all keys, since k1 updates bear no information relative to the k2 updates.
  • causal order means that all messages that depend on other messages are received after these (you can't learn of a user's avatar before you learn about that user). It is a specific type of partial order.

There isn't a "best" ordering, each provides different possibilities and comes with different costs, optimizations, and related failure modes.

Idempotence

Idempotence is important enough to warrant its own entry. Idempotence means that when messages are seen more than once, resent or replayed, they don't impact the system differently than if they were sent just once.

Common strategies is for each message to be able to refer to previously seen messages so that you define an ordering that will prevent replaying older messages, setting unique IDs (such as transaction IDs) coupled with a store that will prevent replaying transactions, and so on.

See Idempotence is not a medical condition for a great read on it, with various related strategies.

State Machine Replication

This is a theoretical model by which, given the same sequences of states and the same operations applied to them (disregarding all kinds of non-determinism), all state machines will end up with the same result.

This model ends up being critical to most reliable systems out there, which tend to all try to replay all events to all subsystems in the same order, ensuring predictable data sets in all places.

This is generally done by picking a leader; all writes are done through the leader, and all the followers get a consistent replicated state of the system, allowing them to eventually become leaders or to fan-out their state to other actors.

State-Based Replication

State-based replication can be conceptually simpler to state-machine replication, with the idea that if you only replicate the state, you get the state at the end!

The problem is that it is extremely hard to make this fast and efficient. If your state is terabytes large, you don't want to re-send it on every operation. Common approaches will include splitting, hashing, and bucketing of data to detect changes and only send the changed bits (think of rsync), merkle trees to detect changes, or the idea of a patch to source code.

Practical Matters

Here are a bunch of resources worth digging into for various system design elements.

End-to-End Argument in System Design

Foundational practical aspect of system design for distributed systems:

  • a message that is sent is not a message that is necessarily received by the other party
  • a message that is received by the other party is not necessarily a message that is actually read by the other party
  • a message that is read by the other party is not necessarily a message that has been acted on by the other party

The conclusion is that if you want anything to be reliable, you need an end-to-end acknowledgement, usually written by the application layer.

These ideas are behind the design of TCP as a protocol, but the authors also note that it wouldn't be sufficient to leave it at the protocol, the application layer must be involved.

Fallacies of Distributed Computing

The fallacies are:

  • The network is reliable
  • Latency is zero
  • Bandwidth is infinite
  • The network is secure
  • Topology doesn't change
  • There is one administrator
  • Transport cost is zero
  • The network is homogeneous

Partial explanations on the Wiki page or full ones in the paper.

Common Practical Failure Modes

In practice, when you switch from Computer Science to Engineering, the types of faults you will find are a bit more varied, but can map to any of the theoretical models.

This section is an informal list of common sources of issues in a system. See also the CAP theorem checklist for other common cases.

Netsplit

Some nodes can talk to each other, but some nodes are unreachable to others. A common example is that a US-based network can communicate fine internally, and so could a EU-based network, but both would be unable to speak to each-other

Asymmetric Netsplit

Communication between groups of nodes is not symmetric. For example, imagine that the US network can send messages to the EU network, but the EU network cannot respond back.

This is a rarer mode when using TCP (although it has happened before), and a potentially common one when using UDP.

Split Brain

The way a lot of systems deal with failures is to keep a majority going. A split brain is what happens when both sides of a netsplit think they are the leader, and starts making conflicting decisions.

Timeouts

Timeouts are particularly tricky because they are non-deterministic. They can only be observed from one end, and you never know if a timeout that is ultimately interpreted as a failure was actually a failure, or just a delay due to networking, hardware, or GC pauses.

There are times where retransmissions are not safe if the message has already been seen (i.e. it is not idempotent), and timeouts essentially make it impossible to know if retransmission is safe to try: was the message acted on, dropped, or is it still in transit or in a buffer somewhere?

Missing Messages due to Ordering

Generally, using TCP and crashes will tend to mean that few messages get missed across systems, but frequent cases can include:

  • The node has gone down (or the software crashed) for a few seconds during which it missed a message that won't be repeated
  • The updates are received transitively across various nodes. For example, a message published by service A on a bus (whether Kafka or RMQ) can end up read, transformed or acted on and re-published by service B, and there is a possibility that service C will read B's update before A's, causing issues in causality

Clock Drift

Not all clocks on all systems are synchronized properly (even using NTP) and will go at different speeds.

Using a timestamp to sort through events is almost guaranteed to be a source of bugs, even moreso if the timestamps come from multiple computers.

The Client is Part of the System

A very common pitfall is to forget that the client that participates in a distributed system is part of it. Consistency on the server-side will not necessarily be worth much if the client can't make sense of the events or data it receives.

This is particularly insidious for database clients that do a non-idempotent transactions, time out, and have no way to know if they can try it again.

Restoring from multiple backups

A single backup is kind of easy to handle. Multiple backups run into a problem called consistent cuts (high level view) and distributed snapshots, which means that not all the backups are taken at the same time, and this introduces inconsistencies that can be construed as corrupting data.

The good news is there's no great solution and everyone suffers the same.

Consistency Models

There are dozens different levels of consistency, all of which are documented on Wikipedia, by Peter Bailis' paper on the topic, or overviewed by Kyle Kingsbury post on them

  • Linearizability means each operation appears atomic and could not have been impacted by another one, as if they all ran just one at a time. The order is known and deterministic, and a read that started after a given write had started will be able to see that data.
  • Serializability means that while all operations appear to be atomic, it makes no guarantee about which order they would have happened in. It means that some operations might start after another one and complete before it, and as long as the isolation is well-maintained, that isn't a problem.
  • Sequential consistency means that even if operations might have taken place out-of-order, they will appear as if they all happened in order
  • Causal Consistency means that only operations that have a logical dependency on each other need to be ordered amongst each other
  • Read-committed consistency means that any operation that has been committed is available for further reads in the system
  • Repeatable reads means that within a transaction, reading the same value multiple times always yields the same result
  • Read-your-writes consistency means that any write you have completed must be readable by the same client subsequently
  • Eventual Consistency is a kind of special family of consistency measures that say that the system can be inconsistent as long as it eventually becomes consistent again. Causal consistency is an example of eventual consistency.
  • Strong Eventual Consistency is like eventual consistency but demands that no conflicts can happen between concurrent updates. This is usually the land of CRDTs.

Note that while these definitions have clear semantics that academics tend to respect, they are not adopted uniformly or respected in various projects' or vendors' documentation in the industry.

Database Transaction Scopes

By default, most people assume database transactions are linearizable, and they tend not to be because that's way too slow as a default.

Each database might have different semantics, so the following links may cover the most major ones.

Be aware that while the PostgreSQL documentation is likely the clearest and most easy to understand one to introduce the topic, various vendors can assign different meanings to the same standard transaction scopes.

Logical Clocks

Those are data structures that allow to create either total or partial orderings between messages or state transitions.

Most common ones are:

  • Lamport timestamps, which are just a counter. They allow the silent undetected crushing of conflicting data
  • Vector Clocks, which contain a counter per node, incremented on each message seen. They can detect conflicts in data and on operations.
  • Version Vectors are like vector clocks, but only change the counters on state variations rather than all event seens
  • Dotted Version Vectors are fancy version vectors that allow tracking conflicts that would be perceived by the client talking to a server.
  • Interval Tree Clocks attempts to fix the issues of other clock types by requiring less space to store node-specific information and allowing a kind of built-in garbage collection. It also has one of the nicest papers ever.

CRDTs

CRDTs essentially are data structures that restrict operations that can be done such that they can never conflict, no matter which order they are done in or how concurrently this takes place.

Think of it as the specification on how someone would write a distributed redis that was never wrong, but only left maths behind.

This is still an active area of research and countless papers and variations are always coming out.

Other interesting material

The bible for putting all of these views together is Designing Data-Intensive Applications by Martin Kleppmann. Be advised however that everyone I know who absolutely loves this book are people who had a good foundation in distributed systems from reading a bunch of papers, and greatly appreciated having it all put in one place. Most people I've seen read it in book clubs with the aim get better at distributed systems still found it challenging and confusing at times, and benefitted from having someone around to whom they could ask questions in order to bridge some gaps. It is still the clearest source I can imagine for everything in one place.

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.