OTP 24.1 Release

img src=http://www.erlang.org/upload/news/

Highlights

  • OTP: The Efficiency Guide has been extended with a chapter about maps.

  • Stdlib: Most output functions in the io module now print extra
    error information when provided with invalid arguments.
    The functions are: io:format, io:fwrite, io:put_chars,
    io:nl and io:write.

  • Erts: init:stop() no longer unloads loaded code before
    terminating the runtime system. On systems with slow
    CPUs (such as Raspberry PI Zero), that can
    significantly speed up termination of the runtime system.

For more details about new features and potential incompatibilities see
https://erlang.org/download/OTP-24.1.README

For more details about new features and potential incompatibilities see

Pre built versions for Windows can be fetched here:

For a quicker build a tar file with all the platform independent parts already built (.beam files and doc chunks) can be found here:

Online documentation can be browsed here:

Many thanks to all the contributors.

The Erlang/OTP source can also be found at GitHub on the official Erlang repository,
https://github.com/erlang/otp

Permalink

Guide to Tracing in Erlang

I’ve spent a lot of time debugging Elixir and Erlang applications over the last 5 years and decided to share what I’ve learned here. I started writing this blog post in April of 2017 and wasn’t inspired to finish it until watching Jeffery Utter’s Debugging Live Systems on the BEAM talk about a month ago. If you’d prefer a more concise presentation of Erlang tracing I suggest you watch that talk instead of reading this.

The first thing to understand when debugging Erlang applications is that Erlang’s concurrency greatly influences how debugging is done. While Erlang does have a traditional debugger its breakpoints pause process execution and may cause other processes in the system to crash if they are waiting on a message from the process that has been paused. Tracing on the other hand can be done under any circumstances and provides just as much information as a typical debugger. Tracing is also significantly more flexible than breakpoint debugging as it doesn’t interfere with process execution and can gather data from multiple processes at once.

Traditional Debugging in Erlang

There are times when setting a few breakpoints is all that’s needed to diagnose a problem. Erlang comes with a debugger module that allows you to set breakpoints that pause execution and allow you to inspect the scope. The debugger includes a GUI that gives you full control over the debugging process.

In this blog post I am using my simple arithmetic module as an example. To debug a module you need to compile the modules you want to debug with the debug_info flag. Use erlc +debug_info to compile modules with debug info on. I ran erlc +debug_info arithmetic.erl. After compiling start the debugger by running debugger:start(). in an Erlang shell:

erl
Erlang/OTP 18 [erts-7.2] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V7.2  (abort with ^G)
1> debugger:start().

The debugger GUI should appear.

screenshot of debugger gui
Figure 1. Debugger GUI

In the GUI navigate to Module > Interpret…​ and choose the module you want to debug from the file dialog. Modules that have not been compiled with debug_info will be grayed out. Once the module is loaded you should see it listed on the “Module” menu. In the submenu for the module click on View. A window with the module source code should appear.

screenshot of debugger source view
Figure 2. Debugger Source View

Double click on any line you want to add a breakpoint too and then hop back in the Erlang shell you started the debugger from and run the code you want to debug. In my case I wanted to debug arithmetic:multiply/2 so I ran:

2> arithmetic:multiply(1,2).

The call will block and a debugger will list the call with a status of break in the main window. Double clicking on the listing will bring up the module source code with variable values and the debugger controls at the bottom. From here you can inspect the values and go to the next breakpoint by clicking the “Continue” button.

screenshot of debugger breakpoint
Figure 3. Debugger Breakpoint

The debugger has all the other features you would expect from a typical debugger. To learn more about it check the debugging page in the Erlang documentation.

There are a couple limitations to debugging in Erlang like this. First is that the module you want to run the debugger on must be compiled with debug_info flag (e.g. erlc +debug_info). Second is that the code being debugged will block when reaching a breakpoint. In many languages this isn’t a problem because everything is sequential and the rest of the program will wait until the user tells the debugger to continue. However in Erlang this can be problematic if another process in the application is waiting on a message from the blocked process. Unaware that the other process is blocked by the debugger it will timeout and may crash if it doesn’t get a message from the blocked process in time. Timeouts often result in one or both processes being killed or crashing. Tracing doesn’t block execution, so it is possible to use it to debug more complex scenarios involving multiple processes.

Tracing Basics

When it comes to tracing in Erlang it’s easy to get overwhelmed by everything. There are several different ways to run traces using the tools that come with Erlang, and then there are numerous other open-source tracing libraries available. We are going to first focus on the most basic Erlang tracing functions - the functions defined in the erlang module. Then we will look at the dbg tracer and one third-party open-source tracing library.

erlang:trace/3 and erlang:trace_pattern/2-3 functions

erlang:trace/3 is the tracing function that all other tracing tools are built on. We are only going to scratch the surface of what is possible with this function. The function takes only 3 arguments but there are many different options that can be passed in as the third argument, and there are several different values that are valid for the first argument.

erlang:trace(PidPortSpec, OnOffBoolean, FlagList) -> integer()

The first argument is either a literal pid, port ID, or atom specifying the processes and ports to trace:

PidPortSpec =
    pid() |
    port() |
    all | processes | ports | existing | existing_processes |
    existing_ports | new | new_processes | new_ports

The documentation explains what all of these options do, but I’ll highlight a few here:

  • all - trace all processes and ports.

  • processes - only trace processes.

  • new - only trace new processes and ports.

  • existing - only trace existing processes and ports.

Only processes are traced. All traces must select the processes to be traced before collecting any trace data. If you want to trace a function call that could be made from any process, you’ll need to trace all processes to collect the trace data for that.

tracing selection illustration
Figure 4. Visual cheatsheet of the possible PidPortSpec values

The second argument is simply a boolean flag that either turns tracing on or off for the specified pids or ports. Set it to true to turn tracing on and false to turn it off. The third argument is a list of flags that control how the tracing is done:

all | send | 'receive' | procs | ports | call | arity |
    return_to | silent | running | exiting | running_procs |
    running_ports | garbage_collection | timestamp |
    cpu_timestamp | monotonic_timestamp |
    strict_monotonic_timestamp | set_on_spawn |
    set_on_first_spawn | set_on_link | set_on_first_link |
    {tracer, pid() | port()} |
    {tracer, module(), term()}

There are a lot of options here and most of these options can be combined. The documentation contains a full explanation of each of these options but I’ll highlight a few of the important ones here:

  • all - all trace flags except tracer and cpu_timestamp. This option is a good default.

  • send, 'receive' - trace sends and receives for messages.

  • garbage_collection - trace garbage collection events.

  • set_on_spawn and set_on_link - extend tracing to any process that is spawned or linked to from a process that is already traced.

  • {tracer, Pid} - specifies the process to send the trace messages to. This is useful when you want to have a process running to handle trace messages and want to do something with them besides sending them to the shell process.

  • {tracer, Module, State} - specifies a tracer function that should be invoked with the trace messages. See erl_tracer docs for how to write a tracer module.

All events you want to record must be selected using trace flags.

tracing events illustration
Figure 5. Visual cheatsheet of some of the possible FlagList values

erlang:trace/3 may be used on it’s own but for tracing function calls erlang:trace_pattern/2-3 must be invoked to tell Erlang what functions you want to trace. erlang:trace/3 is used to specify what processes and ports are to be traced, and erlang:trace_pattern/3 specifies what functions are to be traced. Only when a traced process invokes a traced function does tracing occur. The first argument to this function is a tuple containing the module, function, and arity of the functions that should be traced. Here are a few example arguments:

% All functions in `Module` named `Function` of any arity
{Module,Function,'_'}

% All functions in module `Module`
{Module,'_','_'}

% All functions in all modules
{'_', '_', '_'}

The second argument may be true or false to turn tracing on or off for the trace pattern or a match spec. We come back to match specs shortly. The third argument is a list of options for trace patterns. There are only two I want to list here:

  • global turn on tracing only for function calls explicitly stating the module name (e.g. arithmetic:multiply(3, 4))

  • local turn on tracing for all types of functions calls regardless of how the function is invoked (e.g. multiply(3, 4))

See the docs if you want to read the documentation on the complete list of options.

Example Trace with erlang:trace/3 and erlang:trace_pattern/3

Now let’s setup tracing for one of the functions used by the arithmetic module I showed earlier:

% Turn on call tracing for all processes
1> erlang:trace(all, true, [call]).
40

% Trace all function calls in the arithmetic module
2> erlang:trace_pattern({arithmetic, '_', '_'}, true, [local]).
6

% Run the code in another process
3> spawn(fun() -> arithmetic:multiply(4, 3) end).
<0.126.0>

% Turn tracing off
> erlang:trace(all, false, [call]).
40

% Because we didn't specify a tracer process all trace messages got sent to the shell
> flush().
Shell got {trace,<0.126.0>,call,{arithmetic,multiply,[4,4]}}
Shell got {trace,<0.126.0>,call,{arithmetic,'-multiply/2-fun-0-',[1,0,4]}}
Shell got {trace,<0.126.0>,call,{arithmetic,'-multiply/2-fun-0-',[2,4,4]}}
Shell got {trace,<0.126.0>,call,{arithmetic,'-multiply/2-fun-0-',[3,8,4]}}

You can see the shell received a trace message for every function invoked in the arithmetic module. Because we passed in the local flag to the trace pattern even the anonymous function calls inside the lists:foldl/3 loop were reported. You can also see function calls to other modules were not reported. We don’t see the call to lists:foldl/3 or to lists:seq/2 because the trace pattern only specified the arithmetic module. When the flush() function is invoked in the shell it prints Shell got followed by the message received. Different trace events generate messages with slightly different formats, but all trace messages are tuples with the first item being the atom trace. Since we are only tracing function calls you can see all trace messages have the same format:

{trace, PidOfInvocation, call, {Module, FunctionCalled, ArgumentsInvokedWith}}

The Erlang documentation for the erlang:trace/3 function lists all trace message formats.

Another Trace with erlang:trace/3 and erlang:trace_pattern/3

To demonstrate a few more features of erlang:trace/3 I’m going to use a slightly more complicated example. Instead of the arithmetic.erl module I’ve been using in this blog post I’m going to use a variation of the module that starts a server process solves basic arithmetic. While simple, the module is still fairly lengthy so I won’t show it here, but you can view the module here.

Since we are now tracing an Erlang server, let’s record message send and message receive events as well as function calls in the module. Let’s also process trace messages in a custom tracer process so we can control what we do with the trace data. Here is the code for a primitive tracer process that prints human readable trace messages:

-module(my_tracer).

-export([start/0]).

start() ->
  spawn(fun loop/0).

loop() ->
  receive
      % Print a human readable log line for all trace messages received
      {trace, Pid, send, Msg, To} ->
          % Handle message send message
          io:format("Process ~p sent message ~p to process ~p~n", [Pid, Msg, To]);
      {trace, Pid, 'receive', Msg} ->
          % Handle message receive message
          io:format("Process ~p received message ~p~n", [Pid, Msg]);
      {trace, Pid, return_from, {Module, Function, Arity}, ReturnValue} ->
          % Handle function call return message
          io:format("Function ~p:~p/~p returned ~p in process ~p~n", [Module, Function, Arity, ReturnValue, Pid]);
      {trace, Pid, call, {Module, Function, Arguments}} ->
          % Handle a call trace message
          io:format("Function ~p:~p invoked with arguments ~p in process ~p~n", [Module, Function, Arguments, Pid]);
    Msg ->
      io:format("Received unexpected message ~p~n", [Msg])
  end.

You can see the tracer is expecting messages for sends and receives, but there is also a return_from trace message that it handles. In the previous example we only traced invocation of functions, so we were able to see the arguments they were invoked with, but we couldn’t see the return value of the function call. We tell the tracer to trace function returns as well, but it’s only possible to do so by invoking erlang:trace_pattern/3 with a match spec. I’m not going to go into details of match specs in this blog post, but they are basically patterns that can be used to match function calls the tracer should trace. The most vague match spec that instructs the tracer to capture function returns is [{'_', [], [{return_trace}]}]. We’ll use this along with the tracer above to capture return values.

Now let’s use this tracer on a call to the arithmetic server:

% Start my tracer process
1> Pid = my_tracer:start().

% Start the arithmetic server (I don't want to trace this part)
2> ServerPid = arithmetic_server:start().

% Start tracing calls, and message sends and receives
3> erlang:trace(ServerPid, true, [call, send, 'receive', {tracer, Pid}]).

% Trace all function calls in the arithmetic server module
4> erlang:trace_pattern({arithmetic_server, '_', '_'}, [{'_', [], [{return_trace}]}], [local]).

% Exercise the arithmetic server
5> arithmetic_server:multiply(ServerPid, 4, 3).
Process <0.89.0> received message {<0.85.0>,{multiply,4,3}}
Function arithmetic_server:multiply invoked with arguments [4,3] in process <0.89.0>
12
Function arithmetic_server:'-multiply/2-fun-0-' invoked with arguments [1,0,4] in process <0.89.0>
Function arithmetic_server:'-multiply/2-fun-0-'/3 returned 4 in process <0.89.0>
Function arithmetic_server:'-multiply/2-fun-0-' invoked with arguments [2,4,4] in process <0.89.0>
Function arithmetic_server:'-multiply/2-fun-0-'/3 returned 8 in process <0.89.0>
Function arithmetic_server:'-multiply/2-fun-0-' invoked with arguments [3,8,4] in process <0.89.0>
Function arithmetic_server:'-multiply/2-fun-0-'/3 returned 12 in process <0.89.0>
Function arithmetic_server:multiply/2 returned 12 in process <0.89.0>
Process <0.89.0> sent message 12 to process <0.85.0>
Function arithmetic_server:loop invoked with arguments [] in process <0.89.0>

% Turn tracing off
6> erlang:trace(all, false, [call, send, 'receive']).

Instead of sending traces as messages to the process that invoked erlang:trace/3, the tracer sent the message to the tracer process I specified, which then immediately printed them as they were received. You can see that return values are printed when a function returns. This allows us to know exactly what the arguments and the return value were for each function call. You can see in the example above the anonymous function inside the lists:foldl/3 function is traced, allowing us to see the input arguments and the return value for each loop.

erlang:trace/3 Disadvantages

erlang:trace/3 and erlang:trace_pattern/3 are powerful tracing functions that make it possible to trace function calls, garbage collections, message passing, and other things in the Erlang VM. There isn’t much that cannot be done with the API these functions expose. However, there are some pretty obvious drawbacks: Downsides to the erlang:trace/3 API:

  • The interface is complicated with a lot of options. As you’ve seen in these examples, simply tracing function calls requires at least two function calls, with three arguments to each. It’s a whole lot of ceremony for such a simple trace.

  • The interface is unintuitive. In addition to being complicated, certain features are extremely unintuitive. There are two clear examples of this in the traces I showed: First is turning off tracing. You need to invoke erlang:trace/3 with some of the same values for the third argument as when the trace was started, otherwise tracing won’t be turned off. The second is that return traces can only be enabled via the match specs passed to erlang:trace_pattern/3 and not via a trace flag to erlang:trace/3.

  • Not production safe. It’s easy to crash an Erlang node by simply tracing too many things. It’s also possible to create deadlock by tracing a function that is used by the tracer process. I’ve made simple mistakes and crashed nodes in development using these functions.

dbg

dbg is the only other tracing module that is included with Erlang that I’m going to cover in this article. dbg is a text-based tracer that typically just outputs trace data to the shell. It is part of the runtime_tools application so if you want to use it on your applications you’ll need to make sure they include runtime_tools.

dbg is very flexible just like erlang:trace/3, but the interface is somewhat different. Rather than being just two functions that support a wide variety of arguments, dbg provides fairly large set of functions that each only take a few simple arguments.

Example Trace with dbg

Let’s perform the first trace we did with erlang:trace/3, but this time with dbg.

% Start the tracer
1> dbg:tracer().
{ok,<0.87.0>}

% Turn on call tracing for all processes
2> dbg:p(all, [c]).
{ok,[{matched,nonode@nohost,42}]}

% Trace all function calls in the arithmetic module
3> dbg:tpl({arithmetic, '_', '_'}, []).
{ok,[{matched,nonode@nohost,6}]}

% Exercise the code
4> arithmetic:multiply(4, 3).
(<0.85.0>) call arithmetic:multiply(4,3)
(<0.85.0>) call arithmetic:'-multiply/2-fun-0-'(1,0,4)
(<0.85.0>) call arithmetic:'-multiply/2-fun-0-'(2,4,4)
(<0.85.0>) call arithmetic:'-multiply/2-fun-0-'(3,8,4)
12

% Turn tracing off
5> dbg:stop_clear().
ok

As you can see it’s similar to using erlang:trace/3, but there is another function call. The only big difference is that we first invoke dbg:tracer/0 to start dbg’s tracer process.

The dbg:p(all, [c]) function call is equivalent to erlang:trace(all, true, [call]) and ultimately causes dbg to invoke that same function.

The dbg:tpl({arithmetic, '', ''}, []) function call is equivalent to erlang:trace_pattern({arithmetic, '', ''}, true, [local]). and tells the tracer to trace all global and local function calls to any function in the arithmetic module.

One thing to note here is that dbg prints traces as human readable log lines. Unlike using erlang:trace/3 trace events are not sent as messages, but are formatted and printed as text. For debugging on the shell, this is often more convenient.

Another Trace with dbg

Let’s try tracing messages, function calls and function returns like we did with erlang:trace/3. Just like with the previous example with dbg, the dbg function calls correspond with the erlang:trace/3 and erlang:trace_pattern/3 function calls. The only difference is the API is a little simpler.

1> dbg:tracer().
{ok,<0.87.0>}

% Start the arithmetic server (I don't want to trace this part)
2> ServerPid = arithmetic_server:start().
<0.91.0>

% Start tracing calls, and message sends and receives
3> dbg:p(ServerPid, [m, c]).
{ok,[{matched,nonode@nohost,1}]}

% Trace all function calls and function returns in the arithmetic_server module
4> dbg:tpl({arithmetic_server, '_', '_'}, [{'_',[],[{return_trace}]}]).
{ok,[{matched,nonode@nohost,12}]}

% Exercise the arithmetic server
5> arithmetic_server:multiply(ServerPid, 4, 3).
(<0.90.0>) << {<0.85.0>,{multiply,4,3}}
(<0.90.0>) call arithmetic_server:multiply(4,3)
12
(<0.90.0>) call arithmetic_server:'-multiply/2-fun-0-'(1,0,4)
(<0.90.0>) returned from arithmetic_server:'-multiply/2-fun-0-'/3 -> 4
6> (<0.90.0>) call arithmetic_server:'-multiply/2-fun-0-'(2,4,4)
(<0.90.0>) returned from arithmetic_server:'-multiply/2-fun-0-'/3 -> 8
(<0.90.0>) call arithmetic_server:'-multiply/2-fun-0-'(3,8,4)
(<0.90.0>) returned from arithmetic_server:'-multiply/2-fun-0-'/3 -> 12
(<0.90.0>) returned from arithmetic_server:multiply/2 -> 12
(<0.90.0>) <0.85.0> ! 12
(<0.90.0>) call arithmetic_server:loop()

% Turn tracing off
5> dbg:stop_clear().

As you can see it is largely the same. We still have to pass in a match spec with {return_trace} in the match body to tell dbg we want to trace function returns, just like with erlang:trace_pattern/3.

dbg Disadvantages

dbg is a powerful tracing library that can be used to trace anything and provides an interface that is easier to use than erlang:trace/3 and erlang:trace_pattern/3 but it is still fairly esoteric. The short function names can be confusing and hard to remember. It shares two of the same drawbacks that erlang:trace/3 has:

  • The interface is unintuitive. The function names can be hard to remember and turning on function return tracing still requires that a match spec be passed in with a function body of {return_trace}.

  • Not production safe. Just like with erlang:trace/3, it is easy to crash an Erlang node by tracing too many things.

Other tracing modules included with Erlang

There several other modules that ship with Erlang that offer various tracing related features that I’m not covering here; they may be the topic of future blog posts. Here is a list of them:

  • et - Event Tracer. Used for recording arbitrary events. Can generate a sequence from the recorded events.

  • ttb - Trace Tool Builder. Helps build traces for distributed systems. Relies on dbg and provides some nice shortcuts for starting traces and running them on distributed systems.

  • dyntrace - Contributed module that provides an interface for dynamic tracing using external tools like dtrace.

  • seq_trace - Module for tracing information flowing between processes via message passing.

  • erl_tracer - Erlang tracer behavior that you can implement in your own modules.

Tracing Libraries

There are a lot of third-party open-source debugging and tracing libraries for Erlang. Here are a few a few of the most notable:

  • recon_trace - A module that handles tracing of function calls in a production safe manner for a single Erlang node.

  • redbug - A tool for interacting with Erlang trace functions. Has some safety features built in, such as only using a safe subset of features, terminating the trace if it feels overloaded, and automatically exiting after a certain number of traces or certain period of time.

  • erlyberly - A debugger for Erlang, Elixir, and LFE with a GUI that uses Erlang tracing BIFs.

Of these three, the only one I think worth covering in this blog post is the Recon library’s recon_trace module.

Recon

Recon is an open-source library developed by Fred Hebert that contains various modules for diagnosing production issues with Erlang systems. One of the modules it contains is recon_trace. The documentation sums it up nicely:

recon_trace is a module that handles tracing in a safe manner for single Erlang nodes, currently for function calls only. Functionality includes:

  • Nicer to use interface (arguably) than dbg or trace BIFs.

  • Protection against dumb decisions (matching all calls on a node being traced, for example)

  • Adding safe guards in terms of absolute trace count or rate-limitting

  • Nicer formatting than default traces

The interface for recon_trace is much easier to use than the interfaces of erlang:trace/3 and dbg. The trace output is easier to read than dbg’s default trace output. recon_trace is suitable for tracing nodes in production, with the limitation that it can only run a trace on a single node and only traces function calls. recon_trace also automatically stops tracing after a specified number of calls, and any trace you start must specify the maximum number of calls to trace. This ensures you don’t overload an Erlang node by tracing a function that is invoked often.

Because recon is a library that doesn’t ship with Erlang you’ll need to add it as a dependency to the applications you want to use it in. If you may want to use recon to debug issues in production you will want to add it as a dependency before compiling the production release of your application. Otherwise you’ll encounter an issue in production and won’t have the recon library available to help debug the issue.

If you are using rebar3, adding it is as easy as specifying it in your rebar.config file:

{deps, [recon]}.

And then run rebar3 get-deps to download it.

Tracing with Recon

Because recon_trace only supports tracing function calls, we can’t trace message sends and receives from the arithmetic_server like we were able to with erlang:trace/3 and dbg. But we can still trace functions calls and with recon_trace it’s much simpler. All we do to start tracing is invoke recon_trace:calls/2,3 with a MFA tuple (with either function arity or a match spec as the third value, a maximum number of traces, and optionally a list of options:

% Turn on call and return tracing for the next 20 calls for all processes. For both global and local function calls.
1> recon_trace:calls({arithmetic, '_', fun(_) -> return_trace() end}, 20, [{scope, local}]).
6

% Exercise the code
2> arithmetic:multiply(4, 3).
20:32:49.814072 <0.146.0> arithmetic:multiply(4, 3)
12
20:32:49.825579 <0.146.0> arithmetic:'-multiply/2-fun-0-'(1, 0, 4)
20:32:49.825802 <0.146.0> arithmetic:'-multiply/2-fun-0-'/3 --> 4
20:32:49.826031 <0.146.0> arithmetic:'-multiply/2-fun-0-'(2, 4, 4)
20:32:49.826208 <0.146.0> arithmetic:'-multiply/2-fun-0-'/3 --> 8
20:32:49.826342 <0.146.0> arithmetic:'-multiply/2-fun-0-'(3, 8, 4)
20:32:49.826465 <0.146.0> arithmetic:'-multiply/2-fun-0-'/3 --> 12
20:32:49.826658 <0.146.0> arithmetic:multiply/2 --> 12

% Stop tracing, we don't necessarily have to do this
3> recon_trace:clear().

Conclusion

Erlang itself provides low level functions and numerous tools for debugging application code. We saw that traditional breakpoint debugging is in fact possible in Erlang, but is limited in what it can do due to Erlang’s concurrent nature. We also saw how erlang:trace/3 and erlang:trace_pattern/2,3 allow us to trace almost any event in the Erlang virtual machine, and how to select processes for tracing. We also looked at the dbg library, which provides a different interface for running traces. Finally we looked at recon and the recon_trace module which provides production safe tracing functionality and a more streamlined tracing interface.

All of the tracing examples showed just a few of the things that are possible with the Erlang VM’s low level tracing functions. The VM’s tracing features are incredibly powerful and provide everything we need to inspect a running system and figure out what is going wrong.

If you found this blog post useful you may want to check out the free book, Erlang in Anger, by Fred Hebert, the creator of recon. It covers a lot of production debugging techniques and is geared towards those who may not have experience with the Erlang VM. Also consider subscribing to my blog as I am planning on writing more articles on Erlang and Elixir debugging and tracing in the future.

References

Permalink

Guide to Tracing in Erlang

I’ve spent a lot of time debugging Elixir and Erlang applications over the last 5 years and decided to share what I’ve learned here. I started writing this blog post in April of 2017 and wasn’t inspired to finish it until watching Jeffery Utter’s Debugging Live Systems on the BEAM talk about a month ago. If you’d prefer a more concise presentation of Erlang tracing I suggest you watch that talk instead of reading this.

The first thing to understand when debugging Erlang applications is that Erlang’s concurrency greatly influences how debugging is done. While Erlang does have a traditional debugger its breakpoints pause process execution and may cause other processes in the system to crash if they are waiting on a message from the process that has been paused. Tracing on the other hand can be done under any circumstances and provides just as much information as a typical debugger. Tracing is also significantly more flexible than breakpoint debugging as it doesn’t interfere with process execution and can gather data from multiple processes at once.

Traditional Debugging in Erlang

There are times when setting a few breakpoints is all that’s needed to diagnose a problem. Erlang comes with a debugger module that allows you to set breakpoints that pause execution and allow you to inspect the scope. The debugger includes a GUI that gives you full control over the debugging process.

In this blog post I am using my simple arithmetic module as an example. To debug a module you need to compile the modules you want to debug with the debug_info flag. Use erlc +debug_info to compile modules with debug info on. I ran erlc +debug_info arithmetic.erl. After compiling start the debugger by running debugger:start(). in an Erlang shell:

erl
Erlang/OTP 18 [erts-7.2] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V7.2  (abort with ^G)
1> debugger:start().

The debugger GUI should appear.

screenshot of debugger gui
Figure 1. Debugger GUI

In the GUI navigate to Module > Interpret…​ and choose the module you want to debug from the file dialog. Modules that have not been compiled with debug_info will be grayed out. Once the module is loaded you should see it listed on the “Module” menu. In the submenu for the module click on View. A window with the module source code should appear.

screenshot of debugger source view
Figure 2. Debugger Source View

Double click on any line you want to add a breakpoint too and then hop back in the Erlang shell you started the debugger from and run the code you want to debug. In my case I wanted to debug arithmetic:multiply/2 so I ran:

2> arithmetic:multiply(1,2).

The call will block and a debugger will list the call with a status of break in the main window. Double clicking on the listing will bring up the module source code with variable values and the debugger controls at the bottom. From here you can inspect the values and go to the next breakpoint by clicking the “Continue” button.

screenshot of debugger breakpoint
Figure 3. Debugger Breakpoint

The debugger has all the other features you would expect from a typical debugger. To learn more about it check the debugging page in the Erlang documentation.

There are a couple limitations to debugging in Erlang like this. First is that the module you want to run the debugger on must be compiled with debug_info flag (e.g. erlc +debug_info). Second is that the code being debugged will block when reaching a breakpoint. In many languages this isn’t a problem because everything is sequential and the rest of the program will wait until the user tells the debugger to continue. However in Erlang this can be problematic if another process in the application is waiting on a message from the blocked process. Unaware that the other process is blocked by the debugger it will timeout and may crash if it doesn’t get a message from the blocked process in time. Timeouts often result in one or both processes being killed or crashing. Tracing doesn’t block execution, so it is possible to use it to debug more complex scenarios involving multiple processes.

Tracing Basics

When it comes to tracing in Erlang it’s easy to get overwhelmed by everything. There are several different ways to run traces using the tools that come with Erlang, and then there are numerous other open-source tracing libraries available. We are going to first focus on the most basic Erlang tracing functions - the functions defined in the erlang module. Then we will look at the dbg tracer and one third-party open-source tracing library.

erlang:trace/3 and erlang:trace_pattern/2-3 functions

erlang:trace/3 is the tracing function that all other tracing tools are built on. We are only going to scratch the surface of what is possible with this function. The function takes only 3 arguments but there are many different options that can be passed in as the third argument, and there are several different values that are valid for the first argument.

erlang:trace(PidPortSpec, OnOffBoolean, FlagList) -> integer()

The first argument is either a literal pid, port ID, or atom specifying the processes and ports to trace:

PidPortSpec =
    pid() |
    port() |
    all | processes | ports | existing | existing_processes |
    existing_ports | new | new_processes | new_ports

The documentation explains what all of these options do, but I’ll highlight a few here:

  • all - trace all processes and ports.

  • processes - only trace processes.

  • new - only trace new processes and ports.

  • existing - only trace existing processes and ports.

Only processes are traced. All traces must select the processes to be traced before collecting any trace data. If you want to trace a function call that could be made from any process, you’ll need to trace all processes to collect the trace data for that.

tracing selection illustration
Figure 4. Visual cheatsheet of the possible PidPortSpec values

The second argument is simply a boolean flag that either turns tracing on or off for the specified pids or ports. Set it to true to turn tracing on and false to turn it off. The third argument is a list of flags that control how the tracing is done:

all | send | 'receive' | procs | ports | call | arity |
    return_to | silent | running | exiting | running_procs |
    running_ports | garbage_collection | timestamp |
    cpu_timestamp | monotonic_timestamp |
    strict_monotonic_timestamp | set_on_spawn |
    set_on_first_spawn | set_on_link | set_on_first_link |
    {tracer, pid() | port()} |
    {tracer, module(), term()}

There are a lot of options here and most of these options can be combined. The documentation contains a full explanation of each of these options but I’ll highlight a few of the important ones here:

  • all - all trace flags except tracer and cpu_timestamp. This option is a good default.

  • send, 'receive' - trace sends and receives for messages.

  • garbage_collection - trace garbage collection events.

  • set_on_spawn and set_on_link - extend tracing to any process that is spawned or linked to from a process that is already traced.

  • {tracer, Pid} - specifies the process to send the trace messages to. This is useful when you want to have a process running to handle trace messages and want to do something with them besides sending them to the shell process.

  • {tracer, Module, State} - specifies a tracer function that should be invoked with the trace messages. See erl_tracer docs for how to write a tracer module.

All events you want to record must be selected using trace flags.

tracing events illustration
Figure 5. Visual cheatsheet of some of the possible FlagList values

erlang:trace/3 may be used on it’s own but for tracing function calls erlang:trace_pattern/2-3 must be invoked to tell Erlang what functions you want to trace. erlang:trace/3 is used to specify what processes and ports are to be traced, and erlang:trace_pattern/3 specifies what functions are to be traced. Only when a traced process invokes a traced function does tracing occur. The first argument to this function is a tuple containing the module, function, and arity of the functions that should be traced. Here are a few example arguments:

% All functions in `Module` named `Function` of any arity
{Module,Function,'_'}

% All functions in module `Module`
{Module,'_','_'}

% All functions in all modules
{'_', '_', '_'}

The second argument may be true or false to turn tracing on or off for the trace pattern or a match spec. We come back to match specs shortly. The third argument is a list of options for trace patterns. There are only two I want to list here:

  • global turn on tracing only for function calls explicitly stating the module name (e.g. arithmetic:multiply(3, 4))

  • local turn on tracing for all types of functions calls regardless of how the function is invoked (e.g. multiply(3, 4))

See the docs if you want to read the documentation on the complete list of options.

Example Trace with erlang:trace/3 and erlang:trace_pattern/3

Now let’s setup tracing for one of the functions used by the arithmetic module I showed earlier:

% Turn on call tracing for all processes
1> erlang:trace(all, true, [call]).
40

% Trace all function calls in the arithmetic module
2> erlang:trace_pattern({arithmetic, '_', '_'}, true, [local]).
6

% Run the code in another process
3> spawn(fun() -> arithmetic:multiply(4, 3) end).
<0.126.0>

% Turn tracing off
> erlang:trace(all, false, [call]).
40

% Because we didn't specify a tracer process all trace messages got sent to the shell
> flush().
Shell got {trace,<0.126.0>,call,{arithmetic,multiply,[4,4]}}
Shell got {trace,<0.126.0>,call,{arithmetic,'-multiply/2-fun-0-',[1,0,4]}}
Shell got {trace,<0.126.0>,call,{arithmetic,'-multiply/2-fun-0-',[2,4,4]}}
Shell got {trace,<0.126.0>,call,{arithmetic,'-multiply/2-fun-0-',[3,8,4]}}

You can see the shell received a trace message for every function invoked in the arithmetic module. Because we passed in the local flag to the trace pattern even the anonymous function calls inside the lists:foldl/3 loop were reported. You can also see function calls to other modules were not reported. We don’t see the call to lists:foldl/3 or to lists:seq/2 because the trace pattern only specified the arithmetic module. When the flush() function is invoked in the shell it prints Shell got followed by the message received. Different trace events generate messages with slightly different formats, but all trace messages are tuples with the first item being the atom trace. Since we are only tracing function calls you can see all trace messages have the same format:

{trace, PidOfInvocation, call, {Module, FunctionCalled, ArgumentsInvokedWith}}

The Erlang documentation for the erlang:trace/3 function lists all trace message formats.

Another Trace with erlang:trace/3 and erlang:trace_pattern/3

To demonstrate a few more features of erlang:trace/3 I’m going to use a slightly more complicated example. Instead of the arithmetic.erl module I’ve been using in this blog post I’m going to use a variation of the module that starts a server process solves basic arithmetic. While simple, the module is still fairly lengthy so I won’t show it here, but you can view the module here.

Since we are now tracing an Erlang server, let’s record message send and message receive events as well as function calls in the module. Let’s also process trace messages in a custom tracer process so we can control what we do with the trace data. Here is the code for a primitive tracer process that prints human readable trace messages:

-module(my_tracer).

-export([start/0]).

start() ->
  spawn(fun loop/0).

loop() ->
  receive
      % Print a human readable log line for all trace messages received
      {trace, Pid, send, Msg, To} ->
          % Handle message send message
          io:format("Process ~p sent message ~p to process ~p~n", [Pid, Msg, To]);
      {trace, Pid, 'receive', Msg} ->
          % Handle message receive message
          io:format("Process ~p received message ~p~n", [Pid, Msg]);
      {trace, Pid, return_from, {Module, Function, Arity}, ReturnValue} ->
          % Handle function call return message
          io:format("Function ~p:~p/~p returned ~p in process ~p~n", [Module, Function, Arity, ReturnValue, Pid]);
      {trace, Pid, call, {Module, Function, Arguments}} ->
          % Handle a call trace message
          io:format("Function ~p:~p invoked with arguments ~p in process ~p~n", [Module, Function, Arguments, Pid]);
    Msg ->
      io:format("Received unexpected message ~p~n", [Msg])
  end.

You can see the tracer is expecting messages for sends and receives, but there is also a return_from trace message that it handles. In the previous example we only traced invocation of functions, so we were able to see the arguments they were invoked with, but we couldn’t see the return value of the function call. We tell the tracer to trace function returns as well, but it’s only possible to do so by invoking erlang:trace_pattern/3 with a match spec. I’m not going to go into details of match specs in this blog post, but they are basically patterns that can be used to match function calls the tracer should trace. The most vague match spec that instructs the tracer to capture function returns is [{'_', [], [{return_trace}]}]. We’ll use this along with the tracer above to capture return values.

Now let’s use this tracer on a call to the arithmetic server:

% Start my tracer process
1> Pid = my_tracer:start().

% Start the arithmetic server (I don't want to trace this part)
2> ServerPid = arithmetic_server:start().

% Start tracing calls, and message sends and receives
3> erlang:trace(ServerPid, true, [call, send, 'receive', {tracer, Pid}]).

% Trace all function calls in the arithmetic server module
4> erlang:trace_pattern({arithmetic_server, '_', '_'}, [{'_', [], [{return_trace}]}], [local]).

% Exercise the arithmetic server
5> arithmetic_server:multiply(ServerPid, 4, 3).
Process <0.89.0> received message {<0.85.0>,{multiply,4,3}}
Function arithmetic_server:multiply invoked with arguments [4,3] in process <0.89.0>
12
Function arithmetic_server:'-multiply/2-fun-0-' invoked with arguments [1,0,4] in process <0.89.0>
Function arithmetic_server:'-multiply/2-fun-0-'/3 returned 4 in process <0.89.0>
Function arithmetic_server:'-multiply/2-fun-0-' invoked with arguments [2,4,4] in process <0.89.0>
Function arithmetic_server:'-multiply/2-fun-0-'/3 returned 8 in process <0.89.0>
Function arithmetic_server:'-multiply/2-fun-0-' invoked with arguments [3,8,4] in process <0.89.0>
Function arithmetic_server:'-multiply/2-fun-0-'/3 returned 12 in process <0.89.0>
Function arithmetic_server:multiply/2 returned 12 in process <0.89.0>
Process <0.89.0> sent message 12 to process <0.85.0>
Function arithmetic_server:loop invoked with arguments [] in process <0.89.0>

% Turn tracing off
6> erlang:trace(all, false, [call, send, 'receive']).

You can see that instead of sending messages to the process that invoked erlang:trace/3, the tracer sent the message to the tracer process I specified, which then immediately printed them as they were received. You can also see that return values are printed when a function returns. This allows us to know exactly what the arguments and the return value were for each function call. You can also see that return values are printed when a function call returns. This allows us to know exactly what the arguments and the return value were for each function call. You can see in the example above the anonymous function inside the lists:foldl/3 function is traced, allowing us to see the input arguments and the return value for each loop.

erlang:trace/3 Disadvantages

erlang:trace/3 and erlang:trace_pattern/3 are powerful tracing functions that make it possible to trace function calls, garbage collections, message passing, and other things in the Erlang VM. There isn’t much that cannot be done with the API these functions expose. However, there are some pretty obvious drawbacks: Downsides to the erlang:trace/3 API:

  • The interface is complicated with a lot of options. As you’ve seen in these examples, simply tracing function calls requires at least two function calls, with three arguments to each. It’s a whole lot of ceremony for such a simple trace.

  • The interface is unintuitive. In addition to being complicated, certain features are extremely unintuitive. There are two clear examples of this in the traces I showed: First is turning off tracing. You need to invoke erlang:trace/3 with some of the same values for the third argument as when the trace was started, otherwise tracing won’t be turned off. The second is that return traces can only be enabled via the match specs passed to erlang:trace_pattern/3 and not via a trace flag to erlang:trace/3.

  • Not production safe. It’s easy to crash an Erlang node by simply tracing too many things. It’s also possible to create deadlock by tracing a function that is used by the tracer process. I’ve made simple mistakes and crashed nodes in development using these functions.

dbg

dbg is the only other tracing module that is included with Erlang that I’m going to cover in this article. dbg is a text-based tracer that typically just outputs trace data to the shell. It is part of the runtime_tools application so if you want to use it on your applications you’ll need to make sure they include runtime_tools.

dbg is very flexible just like erlang:trace/3, but the interface is somewhat different. Rather than being just two functions that support a wide variety of arguments, dbg provides fairly large set of functions that each only take a few simple arguments.

Example Trace with dbg

Let’s perform the first trace we did with erlang:trace/3, but this time with dbg.

% Start the tracer
1> dbg:tracer().
{ok,<0.87.0>}

% Turn on call tracing for all processes
2> dbg:p(all, [c]).
{ok,[{matched,nonode@nohost,42}]}

% Trace all function calls in the arithmetic module
3> dbg:tpl({arithmetic, '_', '_'}, []).
{ok,[{matched,nonode@nohost,6}]}

% Exercise the code
4> arithmetic:multiply(4, 3).
(<0.85.0>) call arithmetic:multiply(4,3)
(<0.85.0>) call arithmetic:'-multiply/2-fun-0-'(1,0,4)
(<0.85.0>) call arithmetic:'-multiply/2-fun-0-'(2,4,4)
(<0.85.0>) call arithmetic:'-multiply/2-fun-0-'(3,8,4)
12

% Turn tracing off
5> dbg:stop_clear().
ok

As you can see it’s similar to using erlang:trace/3, but there is another function call. The only big difference is that we first invoke dbg:tracer/0 to start dbg’s tracer process.

The dbg:p(all, [c]) function call is equivalent to erlang:trace(all, true, [call]) and ultimately causes dbg to invoke that same function.

The dbg:tpl({arithmetic, '', ''}, []) function call is equivalent to erlang:trace_pattern({arithmetic, '', ''}, true, [local]). and tells the tracer to trace all global and local function calls to any function in the arithmetic module.

One thing to note here is that dbg prints traces as human readable log lines. Unlike using erlang:trace/3 trace events are not sent as messages, but are formatted and printed as text. For debugging on the shell, this is often more convenient.

Another Trace with dbg

Let’s try tracing messages, function calls and function returns like we did with erlang:trace/3. Just like with the previous example with dbg, the dbg function calls correspond with the erlang:trace/3 and erlang:trace_pattern/3 function calls. The only difference is the API is a little simpler.

1> dbg:tracer().
{ok,<0.87.0>}

% Start the arithmetic server (I don't want to trace this part)
2> ServerPid = arithmetic_server:start().
<0.91.0>

% Start tracing calls, and message sends and receives
3> dbg:p(ServerPid, [m, c]).
{ok,[{matched,nonode@nohost,1}]}

% Trace all function calls and function returns in the arithmetic_server module
4> dbg:tpl({arithmetic_server, '_', '_'}, [{'_',[],[{return_trace}]}]).
{ok,[{matched,nonode@nohost,12}]}

% Exercise the arithmetic server
5> arithmetic_server:multiply(ServerPid, 4, 3).
(<0.90.0>) << {<0.85.0>,{multiply,4,3}}
(<0.90.0>) call arithmetic_server:multiply(4,3)
12
(<0.90.0>) call arithmetic_server:'-multiply/2-fun-0-'(1,0,4)
(<0.90.0>) returned from arithmetic_server:'-multiply/2-fun-0-'/3 -> 4
6> (<0.90.0>) call arithmetic_server:'-multiply/2-fun-0-'(2,4,4)
(<0.90.0>) returned from arithmetic_server:'-multiply/2-fun-0-'/3 -> 8
(<0.90.0>) call arithmetic_server:'-multiply/2-fun-0-'(3,8,4)
(<0.90.0>) returned from arithmetic_server:'-multiply/2-fun-0-'/3 -> 12
(<0.90.0>) returned from arithmetic_server:multiply/2 -> 12
(<0.90.0>) <0.85.0> ! 12
(<0.90.0>) call arithmetic_server:loop()

% Turn tracing off
5> dbg:stop_clear().

As you can see it is largely the same. We still have to pass in a match spec with {return_trace} in the match body to tell dbg we want to trace function returns, just like with erlang:trace_pattern/3.

dbg Disadvantages

dbg is a powerful tracing library that can be used to trace anything and provides an interface that is easier to use than erlang:trace/3 and erlang:trace_pattern/3 but it is still fairly esoteric. The short function names can be confusing and hard to remember. It shares two of the same drawbacks that erlang:trace/3 has:

  • The interface is unintuitive. The function names can be hard to remember and turning on function return tracing still requires that a match spec be passed in with a function body of {return_trace}.

  • Not production safe. Just like with erlang:trace/3, it is easy to crash an Erlang node by tracing too many things.

Other tracing modules included with Erlang

There several other modules that ship with Erlang that offer various tracing related features that I’m not covering here; they may be the topic of future blog posts. Here is a list of them:

  • et - Event Tracer. Used for recording arbitrary events. Can generate a sequence from the recorded events.

  • ttb - Trace Tool Builder. Helps build traces for distributed systems. Relies on dbg and provides some nice shortcuts for starting traces and running them on distributed systems.

  • dyntrace - Contributed module that provides an interface for dynamic tracing using external tools like dtrace.

  • seq_trace - Module for tracing information flowing between processes via message passing.

  • erl_tracer - Erlang tracer behavior that you can implement in your own modules.

Tracing Libraries

There are a lot of third-party open-source debugging and tracing libraries for Erlang. Here are a few a few of the most notable:

  • recon_trace - A module that handles tracing of function calls in a production safe manner for a single Erlang node.

  • redbug - A tool for interacting with Erlang trace functions. Has some safety features built in, such as only using a safe subset of features, terminating the trace if it feels overloaded, and automatically exiting after a certain number of traces or certain period of time.

  • erlyberly - A debugger for Erlang, Elixir, and LFE with a GUI that uses Erlang tracing BIFs.

Of these three, the only one I think worth covering in this blog post is the Recon library’s recon_trace module.

Recon

Recon is an open-source library developed by Fred Hebert that contains various modules for diagnosing production issues with Erlang systems. One of the modules it contains is recon_trace. The documentation sums it up nicely:

recon_trace is a module that handles tracing in a safe manner for single Erlang nodes, currently for function calls only. Functionality includes:

  • Nicer to use interface (arguably) than dbg or trace BIFs.

  • Protection against dumb decisions (matching all calls on a node being traced, for example)

  • Adding safe guards in terms of absolute trace count or rate-limitting

  • Nicer formatting than default traces

The interface for recon_trace is much easier to use than the interfaces of erlang:trace/3 and dbg. The trace output is easier to read than dbg’s default trace output. recon_trace is suitable for tracing nodes in production, with the limitation that it can only run a trace on a single node and only traces function calls. recon_trace also automatically stops tracing after a specified number of calls, and any trace you start must specify the maximum number of calls to trace. This ensures you don’t overload an Erlang node by tracing a function that is invoked often.

Because recon is a library that doesn’t ship with Erlang you’ll need to add it as a dependency to the applications you want to use it in. If you may want to use recon to debug issues in production you will want to add it as a dependency before compiling the production release of your application. Otherwise you’ll encounter an issue in production and won’t have the recon library available to help debug the issue.

If you are using rebar3, adding it is as easy as specifying it in your rebar.config file:

{deps, [recon]}.

And then run rebar3 get-deps to download it.

Tracing with Recon

Because recon_trace only supports tracing function calls, we can’t trace message sends and receives from the arithmetic_server like we were able to with erlang:trace/3 and dbg. But we can still trace functions calls and with recon_trace it’s much simpler. All we do to start tracing is invoke recon_trace:calls/2,3 with a MFA tuple (with either function arity or a match spec as the third value, a maximum number of traces, and optionally a list of options:

% Turn on call and return tracing for the next 20 calls for all processes. For both global and local function calls.
1> recon_trace:calls({arithmetic, '_', fun(_) -> return_trace() end}, 20, [{scope, local}]).
6

% Exercise the code
2> arithmetic:multiply(4, 3).
20:32:49.814072 <0.146.0> arithmetic:multiply(4, 3)
12
20:32:49.825579 <0.146.0> arithmetic:'-multiply/2-fun-0-'(1, 0, 4)
20:32:49.825802 <0.146.0> arithmetic:'-multiply/2-fun-0-'/3 --> 4
20:32:49.826031 <0.146.0> arithmetic:'-multiply/2-fun-0-'(2, 4, 4)
20:32:49.826208 <0.146.0> arithmetic:'-multiply/2-fun-0-'/3 --> 8
20:32:49.826342 <0.146.0> arithmetic:'-multiply/2-fun-0-'(3, 8, 4)
20:32:49.826465 <0.146.0> arithmetic:'-multiply/2-fun-0-'/3 --> 12
20:32:49.826658 <0.146.0> arithmetic:multiply/2 --> 12

% Stop tracing, we don't necessarily have to do this
3> recon_trace:clear().

Conclusion

Erlang itself provides low level functions and numerous tools for debugging application code. We saw that traditional breakpoint debugging is in fact possible in Erlang, but is limited in what it can do due to Erlang’s concurrent nature. We also saw how erlang:trace/3 and erlang:trace_pattern/2,3 allow us to trace almost any event in the Erlang virtual machine, and how to select processes for tracing. We also looked at the dbg library, which provides a different interface for running traces. Finally we looked at recon and the recon_trace module which provides production safe tracing functionality and a more streamlined tracing interface.

All of the tracing examples showed just a few of the things that are possible with the Erlang VM’s low level tracing functions. The VM’s tracing features are incredibly powerful and provide everything we need to inspect a running system and figure out what is going wrong.

If you found this blog post useful you may want to check out the free book, Erlang in Anger, by Fred Hebert, the creator of recon. It covers a lot of production debugging techniques and is geared towards those who may not have experience with the Erlang VM. Also consider subscribing to my blog as I am planning on writing more articles on Erlang and Elixir debugging and tracing in the future.

References

Permalink

The Meaning of Dbg Function Names

If you’ve ever used the dbg module in Erlang you’ve seen the esoteric names of the functions it exports. I recently watched Jeffery Utter’s excellent ElixirConf talk on Debugging Live Systems on the BEAM. In the talk when he covers the dbg:tpl/3 function he guesses that it might stand for “tuple”. This reminded me that I didn’t know what most of the functions names in the dbg module stood for, even though I learned how to use dbg years ago. This got me searching for the meanings behind the dbg function names.

Searching

First I did some Google searches hoping to find some StackOverflow answers or perhaps some old threads on the Erlang questions mailing list. The searches I tried turned up no results, so I looked at the Erlang Git repository and found earliest commit of the dbg.erl file. The commit message didn’t contain anything useful, and the comments in the module source code weren’t any more helpful than the ones present in the latest version of the code. I decided to try to figure out the meanings of the function names on my own.

After reading through the dbg API documentation it was clear to me that the function names were acronyms (except for the ones that aren’t like get_tracer/0 and stop/0). For example, p clearly stood for process. Some of the acrynoms were not so obvious. I wrote a list of what I thought each acronym meant and posted the list on the erlang-questions mailing list. I got some good feedback, and Lucas Larsson confirmed my assumptions.

Meanings

Here is a complete list of dbg function name acronyms.

Function Name Meaning

c/3,4

call

cn/1

clear node

ctp/0,1,2,3

clear trace pattern

ctpe/1

clear trace pattern event

ctpg/0,1,2,3

clear trace pattern global

ctpl/0,1,2,3

clear trace pattern local

dtp/0,1

delete trace pattern

h/0,1

help

i/0

information

ln/0

list nodes

ltp/0

list trace patterns

n/1

node

p/1

process

rtp/1

read trace pattern

tp/2,3,4

trace pattern

tpe/2

trace pattern event

tpl/2,3,4

trace pattern local

wtp/1

write trace pattern

Conclusion

I find this list of definitions makes it much easier for me to remember the function names. It’s easier for me to remember “clear trace patterns for local function calls” that it is for me to remember ctpl. After discussing this on the mailing list I opened up a PR to add this information to the documentation. Hopefully this info will make the dbg API seem a little less esoteric to newcomers.

Permalink

Decentralized ETS Counters for Better Scalability

A shared Erlang Term Storage (ETS) table is often an excellent place to store data that is updated and read from multiple Erlang processes frequently. ETS provides key-value stores to Erlang processes. When the write_concurrency option is activated, ETS tables use fine-grained locking internally. Therefore, a scenario where multiple processes insert and remove different items in an ETS table should scale well with the number of utilized cores. Unfortunately, in practice, the scalability for such scenarios is not yet perfect. However, the ETS table option decentralized_counters (introduced in Erlang/OTP 22 for ordered_set tables and in Erlang/OTP 23 for the other table types) has made the scalability much better. A table with decentralized_counters activated uses decentralized counters instead of centralized counters to track the number of items in the table and the memory consumption. Unfortunately, tables with decentralized_counters activated will have slow operations to get the table size and memory usage (ets:info(Table, size) and ets:info(Table, memory)), so whether it is beneficial to turn decentralized_counters on or off depends on your use case. This blog post will give you a better understanding of when one should activate the decentralized_counters option and how the decentralized counters work.

Scalability with Decentralized ETS Counters

The following figure shows the throughput (operations/second) achieved when processes are doing inserts (ets:insert/2) and deletes (ets:delete/2) to an ETS table of the set type on a machine with 64 hardware threads both when decentralized_counters option is activated and when it is deactivated. The table types bag and duplicate_bag have similar scalability behavior as their implementation is based on the same hash table.

alt text

The following figure shows the results for the same benchmark but with a table of type ordered_set:

alt text

The interested reader can find more information about the benchmark at the benchmark website for decentralized_counters. The benchmark results above show that both set and ordered_set tables get a significant scalability boost when the decentralized_counter option is activated. The ordered_set type receives a more substantial scalability improvement than the set type. Tables of the set type have a fixed number of locks for the hash table buckets. The ordered_set table type is implemented with a contention adapting search tree that dynamically changes the locking granularity based on how much contention is detected. This implementation difference explains the difference in scalability between set and ordered_set. The interested reader can find details about the ordered_set implementation in an earlier blog post.

Worth noting is also that the Erlang VM that ran the benchmarks has been compiled with the configure option “./configure --with-ets-write-concurrency-locks=256”. The configure option --with-ets-write-concurrency-locks=256 changes the number of locks for hash-based ETS tables from the current default of 64 to 256 (256 is currently the max value one can set this configuration option to). Changing the implementation of the hash-based tables so that one can set the number of locks per table instance or so that the lock granularity is adjusted automatically seems like an excellent future improvement, but this is not what this blog post is about.

A centralized counter consists of a single memory word that is incremented and decremented with atomic instructions. The problem with a centralized counter is that modifications of the counter by multiple cores are serialized. This problem is amplified because frequent modifications of a single memory word by multiple cores cause a lot of expensive traffic in the cache coherence system. However, reading from a centralized counter is quite efficient as the reader only has to read a single memory word.

When designing the decentralized counters for ETS, we have tried to optimize for update performance and scalability as most applications need to get the size of an ETS table relatively rarely. However, since there may be applications out in the wild that frequently call ets:info(Table, size) and ets:info(Table, memory), we have chosen to make decentralized counters optional.

Another thing that might be worth keeping in mind is that the hash-based tables that use decentralized counters tend to use slightly more hash table buckets than the corresponding tables without decentralized counters. The reason for this is that, with decentralized counters activated, the resizing decision is based on an estimate of the number of items in the table rather than an exact count, and the resizing heuristics trigger an increase of the number of buckets more eagerly than a decrease.

Implementation

You will now learn how the decentralized counters in ETS works. The decentralized counter implementation exports an API that makes it easy to swap between a decentralized counter and a centralized one. ETS uses this to support the usage of both centralized and decentralized counters. The data structure for the decentralized counter is illustrated in the following picture. When is_decentralized = false, the counter field represents the current count instead of a pointer to an array of cache line padded counters.

alt text

When is_decentralized = true, processes that update (increment or decrement) the counter follow the pointer to the array of counters and increments the counter at the slot in the array that the current scheduler maps to (one takes the scheduler identifier modulo the number of slots in the array to get the appropriate slot). Updates do not need to do anything else, so they are very efficient and can scale perfectly with the number of cores as long as there are as many slots as schedulers. One can configure the maximum number of slots in the array of counters with the +dcg option.

To implement the ets:info(Table, size) and ets:info(Table, memory) operations, one also needs to read the current counter value. Reading the current counter value can be implemented by taking the sum of the values in the counter array. However, if this summation is done concurrently with updates to the array of counters, we could get strange results. For example, we could end up in a situation where ets:info(Table, size) returns a negative number, which is not exactly what we want. On the other hand, we want to make counter updates as fast as possible so having locks to protect the counters in the counter array is not a good solution. We opted for a solution that lets readers swap out the entire counter array and wait (using the Erlang VM’s thread progress system) until no updates can occur in the swapped-out array before the sum is calculated. The following example illustrates this approach:

  • [Step 1]

    A thread is going to read the counter value.

    alt text

  • [Step 2]

    The reader starts by creating a new counter array.

    alt text

  • [Step 3]

    The pointer to the old counter array is changed to point to the new one with the snapshot_ongoing field set to true. This change can only be done when the snapshot_onging field is set to false in the old counter array.

    alt text

  • [Step 4]

    Now, the reader has to wait until all other threads that will update a counter in the old array have completed their updates. As mentioned, this can be done using the Erlang VM’s thread progress system. After that, the reader can safely calculate the sum of counters in the old counter array (the sum is 1406). The calculated sum is also given to the process that requested the count so that it can continue execution.

    alt text

  • [Step 5]

    The read operation is not done, even though we have successfully calculated a count. One must add the calculated sum from the old array to the new array to avoid losing something.

    alt text

  • [Step 6]

    Finally, the snapshot_ongoing field in the new counter array is set to false so that other read operations can swap out the new counter array.

    alt text

Now, you should have got a basic understanding of how ETS’ decentralized counters work. You are also welcome to look at the source code in erl_flxctr.c and erl_flxctr.h if you are interested in details of the implementation.

As you can imagine, reading the value of a decentralized counter with, for example, ets:info(Table, size) is extremely slow compared to a centralized counter. Fortunately, most time that is spent reading the value of a decentralized counter is spent waiting for the thread progress system to report that it is safe to read the swapped-out array, and the read operation does not block any scheduler and does not consume any CPU time during this time. On the other hand, the decentralized counter can be updated in a very efficient and scalable way, so using decentralized counters is most likely to prefer, if you seldom need to get the size and the memory consumed by your shared ETS table.

Concluding Remarks

This blog post has described the implementation of the decentralized counter option for ETS tables. ETS tables with decentralized counters scale much better with the number of cores than ETS tables with centralized counters. However, as decentralized counters make ets:info(Table, size) and ets:info(Table, memory) very slow, one should not use them if any of these two operations need to be performed frequently.

Permalink

Bootstrapping a multiplayer server with Elixir at X-Plane

Welcome to our series of case studies about companies using Elixir in production. See all cases we have published so far.

X-Plane 11 is the world’s most comprehensive and powerful flight simulator for personal computers and mobile devices. X-Plane is not a game but an engineering tool created by Laminar Research that can be used to predict the flying qualities of fixed- and rotary-wing aircraft with incredible accuracy. The X-Plane franchise comes in both consumer and FAA-certifiable professional versions.

Recently, the X-Plane team took on the challenge of adding a multiplayer experience with the goal of hosting north of 10000 users in the same session. This article explores why they chose Elixir and how a team of one developer - without prior language experience - learned the language and deployed a well-received multiplayer experience in 6 months. The overall solution features a brand new open-source implementation of the RakNet communication protocol in Elixir and overperforms the original requirements when put under unexpected load.

X-Plane

Requirements

The X-Plane team has offered peer-to-peer multiplayer in the simulator for a long time but never server-hosted multiplayer. This was a new journey for them and they had complete freedom to pick the technology stack. According to their blog post, their goals were:

  1. To build a rock-solid server with error isolation. For example, an exception during a client update should not bring the whole server down.

  2. To implement a single shared world that can scale to tens of thousands of concurrent pilots.

  3. To iterate quickly: because this was the first time the Laminar Research team offered a hosted multiplayer environment, they wanted to move quickly to ship this system. This would allow users to begin flying in groups immediately and serve as a platform to gauge interest in further feature development.

  4. To be fast and consistent. Multiplayer has a “soft real-time” constraint, and they need to service all clients consistently and on time. Quantitatively, this means the 99th percentile response times matter a lot more than the mean or median.

From those requirements, the need for stability and fast iteration ruled out low-level languages, even the ones in which they had plenty of in-house experience.

The need for speed and vertical scalability excluded many modern web languages, such as Ruby and Python, where the model for scaling up is generally throwing more servers at it. It was essential to avoid synchronizing state across multiple machines, which requires more development time and worsens the user experience due to the increased latency.

They eventually settled on three top contenders: Rust, Go, and Elixir. Elixir took the edge thanks to two exclusive features: fault tolerance and predictable latency. Both are built into the very core of the Erlang Virtual Machine - the robust platform that Elixir runs on. Tyler Young, X-Plane’s engineer leading this implementation, highlights: “We wanted a stack that could max server capacity. We would rather run a 64-core machine than dozens of 4-core VMs. Saša Jurić’s talk, the Soul of Erlang and Elixir, showed us that the concurrency model, process isolation, and partial restarts provided by the platform were the abstractions we were looking for.”

Modeling multiplayer with Elixir

Ready to give Elixir a try, Tyler picked up a couple books but soon realized the language’s Getting Started guide provided the background he needed. He explains: “while the introductory guide covers the language constructs, the advanced guide on the website has you build an actual project with TCP connections, with the basic architectural patterns we would use in production.”

However, instead of jumping headfirst into the multiplayer server, he decided to give Elixir a try on a smaller problem. He wrote a web proxy to the National Oceanic and Atmospheric Administration (NOAA) weather services and put it in production. This experience taught him the importance of leveraging all of the instrumentation and metrics provided by the Erlang VM. They chose AppSignal to help consume and digest this information.

Two weeks later, he started working on the server by implementing the UDP-centric RakNet protocol in Elixir. Unfortunately, there is little documentation, so they had to refer to the reference implementation in C++ most of the time. Luckily, thanks to its roots in telecommunication and network services, Elixir and Erlang have built-in support for parsing binary packets, which made the task a joy. The team also mapped each UDP connection to distinct lightweight threads of execution in Elixir, which we call processes. Elixir processes are cheap, isolated, concurrent, and are fairly scheduled by the runtime. This design allowed the X-Plane team to fully leverage the properties of robustness and predictable latency that first attracted them to the platform. Their implementation is written on top of Erlang’s gen_udp and is open source.

Five months after choosing Elixir, they began welcoming beta testers into the server. The community’s reaction was overwhelmingly positive, and the new multiplayer experience led to a strong uptick in the number of subscriptions as it went live a month later.

Deployment and keeping it simple

At the moment, X-Plane’s player base in North America is powered by a single server, running on 1 eight-core machine with 16GB of memory, although only 200MB or 300MB of memory is effectively used. Each connected player sends 10 updates a second.

For deployments, they use a blue-green strategy, alternating between two servers of the same capacity. Tyler explains: “We are aware the Erlang VM provides hot code swapping and distribution, but we are taking the simplest route whenever possible. It is much easier for us to alternate between two servers during deployments, as the servers are stable and we don’t deploy frequently. Similarly, when it comes to distribution, we prefer to scale vertically or set up different servers in different regions for players across the globe.”

Paul McCarty, who joined the project after launch, can attest to its simplicity: “even without prior Elixir experience, I was able to jump in and add new functionality to our HTTP APIs early on.” Those APIs are built on top of Plug to power their chat services, provide information about connected users, and more. He concludes: “When adding new features, the server development is never the bottleneck.”

Paul and Tyler finished our conversation with a curious anecdote: a couple months ago, they distributed an updated client version with debug code in it. This additional code caused each connected user to constantly ping the server every 100ms, even if not in multiplayer mode. This caused their traffic to increase 1000x! They only discovered this increase 2 weeks later when they saw the CPU usage in their Elixir server went from 5% to 21%. Once they found out the root cause and how the system handled it, they realized they didn’t have to rush a client update to remove the debug code and they chose to maintain their regular release cycle. At the end of the day, it was a perfect example of the confidence they gained and built around the language and platform.

Permalink

Building Erlang 23.3.4.3 on Devuan/Debian/Ubuntu

Erlang R24 is out! …but there are, as usual with a version X.0 release, a few rough edges (enough of them around the important-but-annoying WX and OpenGL updates that I’ll be writing another post about that shortly…).

In the meantime, R23.3.4.3 is excellent and quite reliable. The steps for building with kerl are nearly identical as for R22, but it is worth re-posting them with the relevant version updates (or if you like the video version, Dr. Kumar made a few demonstrating his home Erlang + ZX build). Also, if you’re in a part of the world where erlang.org’s docs are occasionally really really sloooooooowwww don’t forget the R23 docs mirror (or my Erlang page with linky links to everything).

sudo apt update
sudo apt upgrade
sudo apt install \
    gcc curl g++ dpkg-dev build-essential automake autoconf \
    libncurses5-dev libssl-dev flex xsltproc libwxgtk3.0-dev \
    wget vim git
mkdir vcs bin
cd vcs
git clone https://github.com/kerl/kerl.git
cd ..
ln -s ~/vcs/kerl/kerl bin/kerl
kerl update releases
kerl build 23.3.4.3 23.3.4.3
kerl install 23.3.4.3 ~/.erts/23.3.4.3
echo '. "$HOME"/.erts/23.3.4.3/activate' >> .bashrc
. ~/.erts/23.3.4.3/activate
wget -q https://zxq9.com/projects/zomp/get_zx && bash get_zx

As usual, the ~/vcs/ directory is just my convention for version-controlled code that my $HOME sync scripts know to ignore, and you might want to install Erlang to some place global on your system like /opt/erts/ or whatever. The steps above work without root privileges with the exception of the apt commands. Remember if you are on Devuan or Debian that you need to perform the sudo commands actually as root unless you configure sudo on your system, then the rest as your normal local user account.

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.