Scaling reliably during the World’s biggest sports events

Erlang Solutions works with some of the world’s leading online gambling & betting operators. See what we do.

“A busy day for an online betting shop is like Black Friday at Amazon, only prices change every second,” Martin Davies, Head of Technology at bet365 told us a year ago. During big sporting events, like football tournaments, that can go on for weeks with global interest and audience numbers, Amazon’s Black Friday pales in comparison.

Operators need to scale and innovate while serving massive numbers of punters on a daily and hourly basis. Here’s our tips on how to do that:

The innovation-scale dilemma:

the biggest challenge facing online bookmakers

Bookmakers must not only meet the demands of high frequency requests and traffic bursts, but also keep the tech-savvy, multi-channel customers engaged. This is where they come across their biggest challenge; the “innovation-scale dilemma.”

To meet user demand, the industry continuously creates and improves on new types of games and bets. This is when tech departments hit upon the dilemma. Over a certain number of users, the ability to innovate quickly is lost. Precious time is spent rewriting software so that it runs to scale, slowing release time.

To break this vicious cycle, the most innovative industry players – such as bet365 and William Hill – have chosen Erlang (or Elixir, a language based on the same underpinnings as Erlang) as their core technology. Erlang Solutions also supports global payment companies, like Klarna, messaging companies like ooVoo, and IoT companies such as Intel – all of whom face similar technical challenges to online bookmakers.

Read more on SBC News

Innovation at scale

“Innovate or die” is a reality of life for most consumer facing industries these days. As the pace of progress in the computing industry shows no signs of slowing down, disruption by innovative startups is an existential threat to established players in any market.

Online businesses are especially ripe for disruption as the tools and technologies required to start an online business are pretty much available for free with no strings attached. The online gambling and betting industry is no different. Given the recent surge in popularity of online gambling, especially during the World Cup, established players have to innovate just to stay where they are in the marketplace.

Every step of the way, user experience comes first. There is a relentless demand for it to be easier and smoother, regardless of backend challenges. This has left the gambling industry in search of the most efficient and effective ways of directing users to their objective, whether it’s registering, playing a game, placing a bet, cashing out that bet or making a deposit. Systems must be ultra-responsive and able to engage with customers when, where, and how they want.

Those who get this ‘right’ in comparison to their competitors enjoy higher customer engagement and retention, and hence profits. And for each operator, this journey to support customer-centricity through the tech stack is slightly different.

Some established operators have legacy systems that are hard to scale or modernise whilst maintaining uptime and quality of customer experience. Those operators would be wise to take a step back, review their technology choices and evaluate which of the bewildering range of tech, old and new, can best be deployed in helping them achieve these customer-centric goals. It can help to spend some time imagining what an ideal sportsbook architecture would look like, without the burden of legacy, and then mapping out an evolutionary roadmap to get there.

Read more on SBC News

Concurrency, scalability, and reliability

At any one time bet365’s systems are serving many hundreds of thousands of users live odds and results, while managing multiple data streams on the backend. In peak times, like the Super Bowl or the Grand National, the number of users swells by an order of magnitude. This is an awesome feat of concurrent engineering.

When it comes to concurrency, Erlang and Elixir, both built on the BEAM virtual machine, excel. They are both concurrency oriented functional languages, built to handle vast numbers of users at the same time. Users of Erlang across verticals as diverse as telecoms, adtech; financial payments; massive multiplayer online role playing gaming; and social media have all exploited its ability to provide impressive concurrency.

Modern betting infrastructures demand massive scalability, speed, and fault tolerance to run smoothly. Without scalability built into the system, operators can be left to rely on additional hardware to scale vertically, which is not a sustainable nor a cost effective approach.

Erlang and Elixir’s abilities around concurrency go hand-in-hand with their massive scalability. If Erlang isn’t the best tool for every job your system has, a distributed architecture that can plug into Erlang-based products can make your betting system quick to deploy and scale elastically. Tools built in Erlang, such as Riak KV, scale out, up and down predictably. Say goodbye to the headache of emergency hardware and unpredictable system performance under severe load.

Like the stock market, downtime for an online betting operator has immediate financial and reputational consequences. For online sports betting the provision of a ultra-reliable, real-time service is now a priority for bookmakers and punters alike. Just look at the booming in play market for sports events.

In this world, pauses of any kinds, for system failure, garbage collection, or queuing backlogs, are not acceptable. Online betting stacks must handle their constant torrents of data without impacting the system’s processes, or end users.

Erlang and Elixir’s concurrency, no-shared memory architecture and built-in ‘fail and recover’ approach make them behave extremely gracefully and predictably under highly variable stochastic load. In fact, Erlang can support data changing at four times other languages’ rate. This makes Erlang and Elixir ideal to build critical gambling and betting systems on.

Read more on SBC News

The best tool for the job

The software development community is going through a revolution of diversity and inclusion as more developers move away from the biggest programming languages in search of the right tool for the nuanced challenges of our new, technologically-literate world. The result: it’s becoming rarer for there to be de facto programming languages that organisations are forced to use because that’s what their developers community uses.

One-size-fits-all stacks will never scale indefinitely. The most important lesson we’ve learned working in the gaming industry has been realising that a silver bullet for solving every problem simply does not exist, and it’s important to pick a proper tool for the job. Erlang and Elixir are superb at scaling reliably and efficiently when betting operators need it to, but also complement other tools and languages which are better suited to other aspects of a system.

One needs to look no further than bet365, whose Systems Development Manager for Middleware, Andrew Deane, recently went on the record to say “When I joined bet365 4 years ago, the company was in its second wave of Erlang development.

Having proved the model with Erlang, the door was open to explore other languages. Continuing the mantra of the right tool for the right job, we’ve found use cases for Google’s Go in our notifications systems and testing tools and more recently for Elixir in our application tooling.”

Read more on SBC News

Erlang Solutions works with some of the world’s leading online gambling & betting operators. See what we do.

Permalink

OTP-21: New sys_config_src option in relx

With the release of OTP-21 and rebar3 3.6.0, which includes a new relx 3.25.0, there is better support for dynamic configuration of releases at runtime. If you are new to this concept checkout the rebar3 docs first

No longer will you be forced to declare RELX_REPLACE_OS_VARS and convert strings to integers in your code because the sys.config file had to be valid Erlang terms when building the release.

Instead, you can use {sys_config_src, "config/sys.config.src"} and {vm_args_src, "config/vm.args.src"} and the extended start up script will replace variables for .src files without the need to set RELX_REPLACE_OS_VARS.

Since this is based on a change in systools to support including sys.config.src without requiring validation of the contents you can also freely declare a config file that looks like this:

[{myapp, [{port, ${PORT}}]}].

An error, because this isn't a valid Erlang term file, will still be returned if your release configuration uses the sys_config option, but with sys_config_src it does not have to be valid terms until after variable replacement. In this case, PORT=8080 bin/myrel console will use a sys.config with contents [{myapp, [{port, 8080}]}].

vm_args_src was added to relx for consistency. The only thing that changes is how the start script handles it internally and that RELX_REPLACE_OS_VARS is not required for replacement to be run.

This change also will resolve at least a couple annoyances related to how relx had to handle sys.config before, https://github.com/erlang/rebar3/issues/1753 and https://github.com/erlware/relx/issues/604

Permalink

My OTP 21 Highlights

OTP 21 is out! 🎉 In this post I’m going to list things that, I think, will matter the most for Elixir users.

Faster compiler

The compiler is about 10-20% faster. There are lots of factors that contribute to that result - the BEAM emulator is faster itself, the compiler received some performance improvements and the file system was completely overhauled to use NIFs and dirty schedulers instead of port drivers.

The results? For example, compiling ecto with all dependencies from a clean directory (just after rm -rf _build) takes about 10 seconds on OTP 21, while it takes about 12 seconds on OTP 20.3. This might not seem like much, but it’s a noticeable difference - especially on even bigger projects when running tests - after all the test modules are compiled each time they are executed. Taking ecto as example one more time, on OTP 20.3 the test suite completes in 6.5s and on OTP 21.0 in 5.6s.

New callback in GenServer: handle_continue/2

If you’ve written some GenServers, I’m pretty sure you were faced with the “lazy initialization” problem. A situation where you want to return from the init/1 callback to unblock start_link/1 and allow the parent supervisor so start following children, yet you still wanted to execute some, potentially long-running, code before accepting the first message. Using a 0 timeout or sending yourself a message were common solutions to this problem, but they are both prone to race conditions where some other message arrives first (which is very easy for a named server that is restarting).

With the new handle_continue/2 callback the solution is very clean and simple:

defmodule Server do
  use GenServer

  def init(arg) do
    state = execute_sync_initialization(arg)
    {:ok, state, {:continue, :init}}
  end

  def handle_continue(:init, state) do
    state = execute_async_initialization(state)
    {:noreply, state}
  end
end

This will do just what you wanted from the beginning - execute the async initialization code before accepting any message, but allow parent supervisor to continue execution. No race conditions possible - “it just works” ™️!

Optimised creation of small maps

We use maps a lot in our Elixir programs. This means we also create a lot of maps.

The Erlang compiler and runtime optimise “literals” to reside in constant memory - creating such data is basically only a pointer assignment. Additionally this data is excluded from garbage collection and is not copied when sent to other processes in messages. Unfortunately, for maps, this only works when all the keys and values are known at compile-time - a relatively rare occurrence.

Early this year I contributed an optimisation to the compiler, that enhanced this optimisation for maps. The optimisation took advantage of the fact that small maps (under 32) keys are effectively implemented as two tuples sorted in key order - one for keys and one for values. This means that if we know all the keys at compile-time (and this is very frequent and true for all structs), we can allocate the “keys” tuple as a literal and take advantage of all the optimisations literals benefit from in general. This means structs (and all maps in general) created with literal %Foo{....} constructors will need only about half the memory and those maps will be much faster to create.

As a benchmark, let’s use the following program:

defmodule Test do
  fields = for i <- 1..15, do: :"field#{i}"
  defstruct fields

  value = quote(do: value)
  field_values = Enum.zip(fields, Stream.cycle([value]))

  def new(unquote(value)) do
    %__MODULE__{unquote_splicing(field_values)}
  end
end

Benchee.run(%{"new" => fn -> Test.new(1) end}, time: 30)

On OTP 20.3, on my machine1 I get 2.8M iterations per second, while on OTP 21.0 I get 3.5M iterations per second. This obviously, also incorporates some other optimisation work in this release, but I’m convinced the bulk of the speed-up comes from this one. Additionally, when we use :erts_debug.size_shared/1 to measure memory (accounting for structural sharing resulting from the optimised key literals), we notice that a map constructed with Test.new/1 is much smaller:

:erts_debug.size_shared(Test.new(1))

on OTP 20.3 it returns 36 words and on OTP 21 it returns just 19 words (on a 64 bit machine a word is 8 bytes).

This might also mean some other indirect performance gains - for example when comparing structs, the keys tuple will be pointer-compared skipping the entire comparison routine.

The following benchmark:

map1 = Test.new(1)
map2 = Test.new(1)
Benchee.run(%{"compare" => fn -> map1 == map2 end}, memory_time: 0.1, time: 30)

On OTP 20.3 results in 4.3M iterations per second, while on OTP 21.0 I get 5.5M iterations - over a 25% speed-up in such a basic operation!

New guard functions for maps

Another contribution of mine, that I managed to squeeze just before code-freeze for the OTP 21 release is introducing two new guard functions - map_get/2 and is_map_key/2. Their functionality is equivalent to :maps.get/2 and :maps.is_key/2 respectively. But because they are allowed in guards, they enable one of the long requested features - a is_struct/2 guard function. Bear in mind, we’re not yet exactly sure, how those will be exposed in elixir, but it should allow for code at least similar to this:

def process(pattern)
    when is_binary(pattern) or is_struct(pattern, Regex) do
  do_something_with_pattern(pattern)
end

This does not give new capabilities (since it was always possible to do such matching in function bodies), but it should make elixir more expressive.

Those functions are also available in ETS match specs, which means it’s now possible to extract just one field from a map stored in an ETS table, for example:

iex> table = :ets.new(:test, [])
iex> :ets.insert(table, {:key, %{foo: 1, bar: 2}})
iex> ms = [{ {:key, :"$1"}, [], [{:map_get, :"$1", :foo}]}]
iex> :ets.select(table, ms)
[1]

This should help in optimising some programs that store large maps in ETS tables and need just one field - they no longer need to copy the entire map into the process just to extract this field.

New file implementation and enhanced IO scalability

As I mentioned already before in the compiler section, the file system received a rewrite and uses NIFs and dirty schedulers instead of port drivers. This allowed dramatically increasing performance in some situations - even as much as 3 times in some situations.

Additionally, the way that VM polls for new IO events (for example a TCP socket signalling it has data to read) was overhauled as well. This should have positive impact on large, concurrent, servers. It would be interesting to see some benchmarks of Phoenix applications comparing performance on OTP 20 and 21 - I haven’t seen any so far.

And many more…

That’s just some of the changes that I’ve picked up - but there there are many more that will improve our ecosystem. To learn more about this latest release, see full release notes. I also recommend taking a look at the post on the OTP team’s blog by Lukas Larsson talking about the highlights from the perspective of the OTP team.

And there’s already some awesome stuff scheduled for OTP 22!


  1. My machine is:

    Operating System: macOS"
    CPU Information: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
    Number of Available Cores: 8
    Available memory: 16 GB
    Elixir 1.7.0-dev
    Erlang 21.0 (or 20.3)
    

Permalink

Erlang OTP 21.0 is released

img src=http://www.erlang.org/upload/news/

OTP 21

Erlang/OTP 21 is a new major release with new features, improvements as well as incompatibilities.

Potential Incompatibilities

  • All Corba applications are now moved from the OTP repository
  • A new Corba repository will be created https://github.com/erlang
  • New applications ftp and tftp, moved from inets
  • ssl no longer supports 3_DES cipher suites or RSA-key exchange cipher suites by default
  • erlang:monitor on a primitive node (erl_interface, jinterface, etc) will no longer fail with badarg exception. Instead a monitor will be created, but it will only supervise the connection to the node.

 Highlights

 Erts:

  • Enhanced IO scalability
  • Support for usage of distribution controller processes for alternative transports, routing etc
  • compact instructions on 64bit systems for code below 4GB 20% less memory for loaded code
  • Rewrite of the efile-driver with NIFs and "Dirty schedulers" resulting in faster file operations
  • non-smp VM removed
  • link and monitor optimized for scalability
  • os:getenv/putenv now work on thread-safe emulation. No longer in sync with libc getenv(3). Manual synchronization will be needed.

Compiler:

  • Misc compiler optimizations including contributions from the Elixir team resulting in 10% improvements in benchmarks
  • "Tuple calls" have been removed from the run-time system.
  • Code such as f({ok, Val}) -> {ok, Val} is now automatically rewritten to f({ok, Val} = Tuple) -> Tuple. this reduces code size, execution time, and removed GC pressure.
  • More information in stacktrace from a number of operators
  • erlang:get_stacktrace/0 deprecated to be replaced with try ... catch C:R:Stacktrace -> ...
  • Creation of small maps with literal keys optimized.
  • A new predifined macro `OTP_RELEASE` and preprocessor directives `-if` and `-elif`

Security:

  • DTLS is now supported in the SSL application
  • Enhanced support for distribution over TLS
  • "unsecure" ciphers removed from defaults in SSL and SSH.
  • A new option value defined to facilitate implementing exec servers. Old option kept for compatibility, but now gives errors on stderror.

Standard libraries:

  • New API for logging, logger
  • New uri_string module for parsing URIs according to "The standard"
  • New function lists:search(list,fun/1) -> {ok, Value} | false
  • Changed default behaviour of .erlang loading. escript, erlc, dialyzer and typer no longer load an .erlang at all.

For more details see
http://erlang.org/download/otp_src_21.0.readme

Pre built versions for Windows can be fetched here:
http://erlang.org/download/otp_win32_21.0.exe
http://erlang.org/download/otp_win64_21.0.exe

Online documentation can be browsed here:
http://erlang.org/documentation/doc-10.0/doc

The Erlang/OTP source can also be found at GitHub on the official Erlang repository, Here: OTP-21.0

 

Thank you for all your contributions!

Permalink

Long-lived processes in Elixir

One of the things that I loved about Elixir when I first started using the language was the fact that everything runs inside a process (an Erlang VM process, not an operating system process). Each process is lightweight and isolated, and creating and destroying them is fast. As such, Elixir’s processes are front and center, which makes interacting with them both necessary and wonderful.

For example, if you start the interactive console, iex, you’ll see that all your commands are running in a process:

$ iex
Interactive Elixir

iex> self()
#PID<0.84.0>

iex> 1 + 1
2

iex> self()
#PID<0.84.0>

so all of the code you run in the console is running in process 84 (an Elixir/Erlang process).

Naturally, you can choose to run the code in a separate process,

iex> spawn(fn -> 1 + 1 end)
#PID<0.86.0>

spawn/1 creates a new process which runs the function provided, fn -> 1 + 1 end. Interestingly, we do not see the return value of the anonymous function because the function ran in a different process. What we get instead is the process id, pid, of the spawned process.

Notice moreover that I said that the anonymous function ran in another process — ran as in past tense. Once process 86 finished doing all of its work, it exited and we can no longer get any data from it.

iex> pid = spawn(fn -> 1 + 1 end)
#PID<0.86.0>

iex> Process.alive?(pid)
false

So if a process exits when it is finished with its work, how can we have a long-lived process in Elixir?

Well… we just need to give the process work that never ends.

A never-ending story

How can we give a process work that never ends?

As a developer you may have experienced (accidentally) creating an infinitely recursive loop in a program, causing the program to hang for a while and eventually crash. In our case we want to do this intentionally.

Let’s see it in action by writing a simple loop,

defmodule LiveLong do
  def run do
    run()
  end
end

That’s the simplest loop I can think of. Now in iex we can spawn a process that uses that function,

iex> pid = spawn(LiveLong, :run, [])
#PID<0.96.0>

iex> Process.alive?(pid)
true

iex> Process.info(pid)
[
  current_function: {LiveLong, :run, 0},
  status: :running,
  # more info
]

There we go! We have created a long-lived process. It will keep running in an infinite loop until we tell it to exit. But magically, it will not crash our program.

Why doesn’t it crash?

In other languages, if you get stuck in an infinitely recursive loop, your program hangs for a while and then crashes. Why doesn’t that happen here?

Elixir has tail call optimization (or really last call optimization). That means that so long as we call the same function at the end of the execution path, Elixir will not allocate a new stack frame to the call stack, and our program will not suffer from a stack overflow.

Okay, but why do I care?

It is true that you may not need a long-running process if all you need is to run a script or perform a one-off task. But having long-running processes is key to programming in Elixir. For example, it is how GenServers can send and receive messages while keeping state, and it is how Supervisors can provide fault tolerance to your system by knowing and restarting other processes that crash.

In order to make this sink in even more, let’s create something that resembles a simple GenServer - a process that is long-lived, can handle messages, and can store state.

defmodule SimpleGenServer do
  def start do
    initial_state = []
    receive_messages(initial_state)
  end

  def receive_messages(state) do
    receive do
      msg ->
        {:ok, new_state} = handle_message(msg, state)
        receive_messages(new_state)
    end
  end

  def handle_message({:store, value}, state) do
    {:ok, [value | state]}
  end

  def handle_message({:send_all_values, pid}, state) do
    send(pid, {:all_values, state})
    {:ok, state}
  end
end

Let’s do a quick breakdown of each function,

  • start/0 simply defines an empty list as the initial state and calls receive_messages/1.

  • receive_messages/1 is where the magic happens. Here we use the receive/1 Elixir primitive to pattern match messages out of our mailbox. In our case, we grab any message that comes in (msg) and pass it to our handle_message/2 function along with the state. The handle_message/2 function will do something with the message and state, and we expect it to return an updated state. We then perform the magic trick of making a recursive tail call. We call receive_messages/1 with the new state.

  • handle_message({:store, value}, state) will simply pattern match when we want to store a value, and it will add it to the current state.

  • handle_message({:send_all_values, pid}, state) receives a pid to which it will send all the values found in the process (the current state).

Now let’s interact with it,

iex> pid = spawn(SimpleGenServer, :start, [])
#PID<0.146.0>

iex> send pid, {:store, 23}
iex> send pid, {:store, "hello"}

iex> iex_pid = self()
iex> send pid, {:send_all_values, iex_pid}

iex> flush()
{:all_values, ["hello", 23]}

where flush/0 is a convenience function in iex that allows us to get all the messages in the console’s mailbox instead of having to pattern match them out with receive/1.

As you can see, our long-lived process successfully stored the data we sent to it, and we were able to retrieve all the values via message passing.

Long live long-lived processes!

Permalink

A Brief History of the BEAM Compiler

This blog post is a brief history lesson about the Erlang compiler for the BEAM machine. To provide some context, there will first be a quick look at the abstract machines for Erlang.

A brief overview of the early Erlang implementations

The Prolog interpreter

The first version of Erlang was implemented in Prolog in 1986. That version of Erlang was used to find out which features of the languages were useful and which were not. New languages features could be added or deleted in a matter of hours or days.

JAM (Joe’s Abstract Machine)

It soon became clear that Erlang needed to be at least 40 times faster to be useful in real projects.

In 1989 JAM (Joe’s Abstract Machine) was first implemented. Mike Williams wrote the runtime system in C, Joe Armstrong wrote the compiler, and Robert Virding wrote the libraries.

JAM turned out be 70 times faster than the Prolog interpreter. Success?

TEAM (Turbo Erlang Abstract Machine)

It soon became clear that Erlang still needed to be faster to be useful in real projects.

Therefore Bogumil (“Bogdan”) Hausman created TEAM (Turbo Erlang Abstract Machine). It compiled the Erlang code to C code, which was then compiled to native code using GCC.

It was significantly faster than JAM for small projects. Unfortunately, compilation was very slow, and the code size of the compiled code was too big to make it useful for large projects.

BEAM (Bogdan’s Erlang Abstract Machine)

Bogumil Hausman next machine was called BEAM (Bogdan’s Erlang Abstract Machine). It was a hybrid machine that could execute both native code and threaded code with an interpreter. That allowed customers to compile their time-critial modules to native code and all other modules to threaded BEAM code. The threaded BEAM in itself was faster than JAM code.

Bogdan’s original compiler for BEAM shared the compiler front end with JAM. Essentially, the front end at that time did the same thing as the front end in the current compiler as described in Lost in Translation (Exploring the Compiler’s Front End).

I don’t have the source code for Bodgan’s original compiler, but as far as I can determine it had three compiler passes that translated the abstract format to threaded BEAM code.

  • beam_compile - Translated the abstract format to BEAM instructions.

  • beam_optimize - Optimized the BEAM instructions. This pass was mandatory, since it did some necessary transformations of the BEAM instructions.

  • beam_asm - Converted the symbolic BEAM assembly format to a binary BEAM module.

VEE (Virding’s Erlang Engine)

Here we must mention VEE (Virding’s Erlang Engine) for reasons that will soon become clear.

VEE was an experimental implementation with a different memory model compared to JAM and BEAM. Instead of JAM’s and BEAM’s separate heaps for each process, VEE used a single shared heap with a real-time garbage collector. That made message passing blindlingly fast compared to JAM and BEAM.

Overall, though, there was no speed gain compared to JAM. The reason was probably that the single shared heap decreased the cache hit rate.

The maturation of BEAM

The OTP group and Erlang/OTP was created to industrialize Erlang and make it suitable for huge real-world projects. The first release, OTP R1B, was released in 1996.

This is the point where the history lesson may become a little bit more subjective.

I joined the Erlang/OTP team at the end of 1996. My first small code contributions to Erlang/OTP were included in OTP R1D.

I worked in the ERTS (Erlang Run-Time System) team, which at that time was lead by Kenneth Lundin. Initially I worked with the Erlang runtime system for Microsoft Windows. After some time (maybe a year or so), Kenneth asked me to help stabilizing and improving BEAM. Gradually BEAM become my main responsibility, and when Bogdan left Ericsson, I become the main developer responsible for the BEAM interpreter and compiler.

This blog post desperately tries to cover the history of the BEAM compiler, but I think that some more historical context is needed before we can approach the compiler.

The overall goal of the work on BEAM from OTP R1 up to OTP R5 was to make it stable enough and fast enough to be useful in real projects.

There were two major obstacles to reaching that goal:

  • BEAM/C, that is, native code via C code.
  • The huge number of ever-changing BEAM instructions.

BEAM/C must die!

It soon became obvious that BEAM/C, the compiler passes that compiled Erlang code to C code, had to die. At the time that I started working on BEAM, there were three distinct flavors of BEAM/C: one for GCC on Sparc, one for GCC on non-sparc CPUs (such as Intel x86), and one for other C compilers that did not support GCC’s extension for taking the address of a label. Bugs not only showed up in the native code, but the mere existence of BEAM/C complicated and caused bugs in the threaded BEAM interpreter.

Unfortunately, early in my career of improving BEAM, I made some optimizations of the size of the C code generated by BEAM/C. That came back to bite me later when I suggested that we should remove BEAM/C. The size improvements made it possible to fit more Erlang code compiled to native code into the system, and the native code was faster than threaded BEAM code. Our customer at the time (the AXD 301 project) needed the extra speed improvements that BEAM/C gave them and did not allow us to remove BEAM/C unless we could improve the performance of threaded BEAM code to similar or better than BEAM/C performance.

The ever-changing BEAM instructions

At that time, the BEAM interpreter had over 300 instructions. While JAM had a very simple loader that essentially only loaded the JAM files into memory, the loader for BEAM had to translate every instruction from the byte format in the BEAM files to the threaded code format in memory. The BEAM had hand-written code for the loading of every single instruction.

To make it worse, the instruction set was constantly evolving. Bug fixes and performance improvements needed new instructions, and those instructions had to be implemented in the compiler, threaded code interpreter (the process_main() function in beam_emu.c), and the loader. In every minor and major release of Erlang/OTP, the users of BEAM had to recompile all of their Erlang code because the instruction set had changed.

There must be a better way, I thought. I started to write a simple Perl script to a least automate the mapping from instruction name to instruction number in the compiler, interpreter, and loader. Tony Rogvall suggested that I could be more ambitious and generate most of the code for for the loader using the Perl script. He also suggested that operands for many instructions could be packed into a single word. That would reduce load code size and also improve the cache hit rate, improving execution speed.

So I started writing the first version of the beam_makeops script and rewriting the loader. I prefer to work incrementally, making minor changes to a code base that is always working. But I could not rewrite the loader incrementally, so I hacked away frantically for two or three days until I had a bare bones version of the new loader working. I could then relax a little and somewhat more slowly add more features to beam_makeops and the loader.

The new loader took over some tasks formerly done by the compiler.

For example, the BEAM machine has several specialized move instructions. There is one instruction for moving something into an X register, another for moving an atom into an X register, and so on. Before the new loader, the compiler knew about all those variants of move instructions and selected the appropriate one. With the new loader, there is only one move instruction that the compiler needs to care about, and the loader will select the appropriate specialized move instruction to use at load time.

Another minor optimization done by the compiler was combining of common instructions sequences. For example, a move instruction followed by a call instruction would be combined to a move_call instruction. That optimization was also moved to the loader.

All those capabilities made it possible to significantly simplify and reduce the number of instructions known to the compiler. More importantly, that made it possible to keep the instruction set stable (while still allowing minor optimizations and performance tuning by tweaking only the loader and interpreter), avoiding the need to recompile all Erlang code every time there was a new release.

If my memory doesn’t fail me, the new loader was introduced in OTP R4.

OTP R5B: The “new” BEAM

Moving forward to OTP R5.

OTP R5 was the last release that supported JAM.

OTP R5 can also be said to be first release that featured the “new” BEAM. In that release, the modern BEAM file format was introduced. The same file format is used today. At that time, there were 78 BEAM instructions; in OTP 20, there are 159 instructions (actually, 129 active instructions and 30 obsoleted instructions no longer used). While new instructions have been introduced when needed and obsolete instructions have been removed, it has always been possible to load BEAM files compiled from at least two major releases back.

Execution of threaded BEAM had become fast enough, so that BEAM/C could be dropped (already in R4, I think). But strangely enough, the customers still wanted more speed.

The BEAM compiler in R5 was still Bogdan’s original compiler. While it did more optimizations than the JAM ever did, we knew that more optimizations were possible.

R6B: Enter Kernel Erlang

Meanwhile, on the top floor Robert Virding was busy writing a new compiler for his VEE machine. In that new compiler, Robert introduced a new intermediate format that he called Kernel Erlang. The idea was that more optimizations could be applied to the code in that format before generating code for the actual machine.

At that time, there was no actual interpreter that could execute the code emitted by his new compiler (he had not updated the VEE machine yet). The machine he had in mind was a register machine. It was similar to BEAM, except that it did stack trimming.

We wanted the better performance that we could get from Robert’s compiler, but the question was: should we implement a new interpreter (or adapt BEAM) to execute the code from Robert’s compiler, or should we adapt Robert’s compiler to generate BEAM code?

Because we now for the first time had a stable implementation of BEAM, we decided not to rock the boat again; thus, we decided that I should adapt the code generator part of Robert’s compiler for BEAM.

For the most part, I used Robert’s name for instructions. For example, the instruction to load a term into a register was called M in the original BEAM, while Robert’s compiler used the move. The more major changes was in the handling of the stack. Robert’s compiler had stack trimming, which I had to remove and rewrite to handle BEAM’s fixed stack frame. (I reintroduced a limited form of stack trimming later.)

Since JAM was not supported in OTP R6, all customers that had previously used JAM had to migrate to BEAM. To minimize the risk of the migration as much as possible, one of our customers requested that we made the battle-tested original BEAM compiler available as an option in OTP R6.

Therefore, we added options to choose which version of the compiler to use. To use the old compiler, one would write:

$ erlc +v1 some_module.erl

Default was Robert’s new compiler, which was called v2. There was also an undocumented, unofficial compiler version called v3.

All compilers shared the front end and the beam_asm pass that created the final BEAM module.

The v1_compiler

The v1 compiler had the following passes:

  • v1_adapt
  • v1_compile
  • v1_optimize
  • v1_cleanup

The v1_compile and v1_optimize passes were essentially the beam_compile and beam_optimize passes from Bogdan’s compiler.

There had been some changes to the front end since R5, so the v1_adapt pass was there to hide those changes for the v1_compile and v1_optimize passes. The v1_cleanup pass was an additional minor optimization pass; I think it was present in OTP R5 as well.

The v2_compiler

The v2 compiler was Robert’s new compiler. It had the following passes:

  • v2_kernel
  • v2_kernopt
  • v2_match
  • v2_life
  • v2_codegen

The v2_kernel pass translated the abstract format to Kernel Erlang.

v2_kernopt did very basic optimizations of the Kernel Erlang code, essentially only constant propagation and constant folding.

v2_match did pattern matching compilation. JAM would match clauses in function heads or case expressions sequentially. The old BEAM compiler would do only a little bit better in that it could match multiple integers or atoms in a single instruction. Robert’s compiler was the first Erlang compiler to properly compile pattern matching using the algorithm described in The Implementation of Functional Programming Languages by Simon Peyton Jones.

v2_life would calculate life-time information needed by the v2_codegen pass, and v2_codegen would generate the BEAM assembly code.

R7B: Enter Core Erlang

Meanwhile, Richard Carlsson and the HiPE group at Uppsala University come up with the idea for a new intermediate format useful as an interchange format for different Erlang implementations and for optimizing Erlang programs.

The new format was called Core Erlang. Robert liked the idea and started to implement Core Erlang in the compiler. The undocumented implementation of v3 compiler in OTP R6 is based on a draft version of the Core Erlang specification.

In OTP R7B, the v1 and v2 compilers were removed, and the only remaining compiler was the v3 compiler that used Core Erlang. It had the following passes:

  • v3_core
  • v3_core_opt
  • v3_kernel
  • v3_life
  • v3_codegen

The v3_core pass translated the abstract format to Core Erlang.

The v3_core_opt pass essentially only called sys_core_fold, which did constant propagation and constant folding. sys_core_fold still do those things, and more.

The remaining passes do the same thing as today.

The v3_kernel pass translates from Core Erlang to Kernel Erlang, and also does pattern matching compilation (in the same way as in v2_match). The optimizations in v2_kernopt are now done in sys_core_fold.

The v3_life pass (despite its name) no longer calculates life-time information. The life-time information is instead calculated by v3_kernel and passed on as annotations.

The reason that v3_life still exists is that Robert had continued to work on his own version of codegen that did not have all my changes in it to work for BEAM. While implementing the Core Erlang passes, he also did many improvements to codegen.

When it was time to integrate our different versions of the compiler, Robert looked in horror at all my changes in codegen. To avoid having to reintroduce all my adapations and optimizations for BEAM into his new version of codegen, Robert wrote an adapter pass that translated from the new Kernel Erlang format to the old format so that my codegen would work. The adapter pass is called v3_life.

Thus, v3_codegen is essentially v2_codegen with a new name.

In the upcoming OTP 21, v3_life has been combined with v3_codegen.

Learning Erlang from Robert

In the time period that Robert and I worked together on the compiler, I usually worked on v3_codegen and the passes below, while Robert worked on all passes above v3_codegen.

Occasionally, I would add some optimizations to sys_core_fold and give them to Robert to incorporate into his latest version of sys_core_fold.

I would then look at what Robert had done with my code, and learn.

Usually Robert had subtly improved my code, made it slightly cleaner and simpler. But one time I handed Robert an optimization of case clauses. The code I got back was very different. Robert had broken apart my optimization into several simpler optimizations that achieved the same purpose (and more) than my more complicated optimization.

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.