All For Reliability: Reflections on the Erlang Thesis

If you ask Elixir developers what got us interested in the language, many will say “concurrency.” We wanted to make our programs faster by making them use more cores.

That desire is a big part of why Elixir exists. Before creating Elixir, José Valim was trying to make Rails thread-safe, and found it frustrating. He remembered thinking:

“If the future is going to have [many] cores, we needed to have better abstractions, because the ones I had working with Ruby and Rails were not going to cut it. So I decided to study other languages and see what they were doing.”

That study led him to Erlang, and he built Elixir to run on the Erlang virtual machine.

Similarly, Chris McCord wanted to use WebSockets in his Rails apps, but struggled to create a scalable solution in Ruby. Then he heard about WhatsApp using Erlang to get 1 million connections on a single machine.

“That kind of blew my mind, because I was looking at getting maybe a hundred connections on my Rails app.”

That got McCord interested in Erlang, then Elixir. He went on to create Elixir’s main web framework, Phoenix, known for sub-millisecond response times and massively scalable WebSocket support.

Personally, I worked on a travel search engine that needed to do a lot of on-demand concurrent work, but was written in a language that made that difficult. I learned about Elixir while at that job, and though we couldn’t adopt it there, I knew it was a tool I’d want to use in the future.

But although he calls it a “concurrency-oriented programming language”, Erlang co-creator Joe Armstrong does not cite speed as the main reason for its creation in his 2003 PhD dissertation, “Making reliable distributed systems in the presence of software errors”. Its purpose is there in the title: reliability.

What’s fascinating is how many of Erlang’s (and therefore Elixir’s) attributes are direct consequences of designing it to be reliable.

Mistakes Are Inevitable

In the paper, Armstrong talks about the challenges his team faced in writing telephone switching systems in the 80s.

First, they needed to write complex systems, with “several millions of lines of program code”, and might have teams with “many hundreds of programmers” of varying experience levels.

The requirements were demanding:

Market pressure encourages the development and deployment of systems with large numbers of complex features. Often systems are deployed before the interaction between such features is well understood. During the lifetime of a system, the feature set will probably be changed and extended in many ways.

Under such constraints, there was no way they could produce perfect software.

Yet something approaching perfection was required. Telephone calls are important; some of them are emergencies. You can’t turn off the phone system each night for maintenance. A telephone system needs “to be in continuous operation for many years”, “typically having less than two hours of down-time in 40 years.”

It’s quite a problem! How can an imperfectly-written system get near-perfect results? Or as Armstrong puts it:

The central problem addressed by this thesis is the problem of constructing reliable systems from programs which themselves may contain errors.

Handling the Worst Case

Before solving the problem, Armstrong makes it harder by considering the worst case: hardware failure. Imagine a perfectly-written program running on a computer that bursts into flames.

There are a lot of failures from which a process can recover, but ceasing to exist is not one of them.

Armstrong addresses this situation simply.

To guard against the failure of an entire computer, we need two computers.

Specifically:

If the first computer crashes, the failure is noticed by the second computer, which will try to correct the error.

This might mean, for example, creating a new process elsewhere to replace the one that died.

Here we see the first interesting property of Erlang: it’s made to run systems that span multiple machines, because that’s what’s needed for reliability.

Even better: since the failure of one process must be corrected by another process when they’re on separate machines, Erlang uses that mechanism for all failures. All failures are handled via “monitors” and “links”, which are ways for one process to react to the failure of another, supported directly by the VM. (These mechanisms are the foundation for the supervision tools and patterns of OTP.)

In fact, the Erlang VM goes so far as to make hardware failures look like software failures. If Process A is monitoring Process B and Process B dies, Process A is notified. This happens whether Process B divides by zero, loses connectivity, or dies in a fire; in the last two cases, the “distress call” is faked by the VM when it loses contact with Process B.

As Armstrong says:

The reason for coercing the hardware error to make it look like a software error is that we don’t want to have two different methods for dealing with errors… we want one uniform mechanism. This, combined with the extreme case of hardware error, and the failure of entire processors, leads to the idea of handling errors, not where they occurred, but at some other place in the system.

Process Properties

Now consider the two processes mentioned above, running on separate machines. There are certain things we can be sure of.

  • They can’t share memory because they’re physically separated. This is nice, because a failing process can’t corrupt the memory of the process that’s watching it.
  • They can communicate only by passing messages.
  • They will succeed or fail independently.

This separation is analogous to a firewall. In construction, a “firewall” is a wall between two parts of a building that keeps fire from spreading. A maximally-safe house would have firewalls around every room, chair, and bed. In a house, this is impractical. But in an Erlang system, process boundaries are like firewalls for failures, and they cost almost nothing.

And again, for uniformity, these same “firewall-like” characteristics are true whether two processes run on separate machines or not. Processes on the same machine share no memory and communicate only by passing messages, just as if they were on separate machines.

( As Armstrong noted, Erlang’s process isolation isn’t perfect; if one process allocates an inordinate amount of memory or atoms, for example, it could crash the Erlang VM on that machine, including all its processes. )

To further improve reliability, several more properties are needed.

First, messages must be asynchronous. As Armstrong says:

Isolation implies that message passing is asynchronous. If process communication is synchronous then a software error in the receiver of a message could indefinitely block the sender of the message, destroying the property of isolation.

Second, processes must be lightweight. If safety is increased by dividing a system into more processes, we’ll want to run a lot of them, creating them quickly and on-demand.

And third, processes must take turns. As Armstrong says:

Concurrent processes must also time-share the CPU in some reasonable manner, so that CPU-bound processes do not monopolise the CPU, and prevent progress of other processes which are “ready to run.”

Like most operating systems, the Erlang VM uses “preemptive multitasking”. This means that each process gets a fixed amount of time to use the CPU. If a process isn’t finished when its turn is up, it is paused and sent to the back of the line, then another process gets a turn. It’s also paused if it’s waiting to read a file or get a network response.

In this way, the Erlang VM supports “non-blocking IO” as well as “non-blocking computation”, both of which get applied automatically to sequential-looking code.

Where Reliability and Performance Meet

You might have noticed that those last two points of reliability are also performance concerns. That’s because the two are related.

Imagine a system that’s handling a large number of small tasks - calls, web requests, or whatever. In comes a large task. What happens?

If the system’s overall performance takes a dive, it’s not reliable. Calls get dropped, web requests time out, and so on. A reliable system must perform consistently.

This is the logic behind the Erlang VM’s multitasking. In absolute terms, frequent task-switching wastes a little time, making performance sub-optimal. But the benefit is that performance is consistent: small tasks continue completing quickly, while large tasks get processed a little at a time.

Garbage collection works this way, too. Although Erlang’s immutable data means that a lot of garbage is created, it’s divided into tiny heaps across many processes. When a process dies, its memory is freed, and GC isn’t needed. For long-running processes, GC is performed concurrently. There are no “stop the world” pauses to collect garbage from the entire system, so GC is no barrier to consistent performance.

Evidence for Reliability

After all this effort toward building reliable systems, Armstrong tried to find out how well several Erlang-based systems had worked; he interviewed the maintainers, analyzed the source code, and examined bug reports.

Armstrong has elsewhere cited Ericsson’s ADX301 switch as an example of “nine nines” reliability - an uptime of 99.9999999%.

Here’s how he describes the switch:

The ADX301 is designed for “carrier-class” non-stop operation. The system has duplicated hardware which provides hardware redundancy and hardware can be added or removed from the system without interrupting services. The software has to be able to cope with hardware and software failures. Since the system is designed for non-stop operation, it must be possible to change the software without disturbing traffic in the system.

In his thesis, he treats the “nine nines” figure as uncertain:

Evidence for the long-term operational stability of the system had also not been collected in any systematic way. For the Ericsson AXD301 the only information on the long-term stability of the system came from a PowerPoint presentation showing some figures claiming that a major customer had run an 11 node system with a 99.9999999% reliability, though how these figure had been obtained was not documented.

And that particular figure has been the subject of some debate.

However, Armstrong’s investigations seemed to indicate that the Erlang systems he examined were indeed extremely reliable:

The software systems in the case study are so reliable that the people operating these systems are inclined to think that they are error-free. This is not the case, indeed software errors did occur at run-time but these errors were quickly corrected so that nobody ever noticed that the errors had occurred.

Performance After All

So here’s a recap of what the Erlang VM gives us in pursuit of reliability:

  • Lightweight, shared-nothing processes that communicate via asynchronous messages
  • A built-in mechanism for processes to react to failures in other processes
  • The ability to quickly spawn a large number of processes and run them on multiple machines
  • Efficient context-switching and concurrent garbage collection

No wonder Armstrong calls Erlang a “concurrency-oriented programming language”. These features, created mainly for reliability, make it easy to write concurrent programs that scale horizontally across cores and clusters. You write your program using processes and messages, and the VM takes care of all the tricky parts - running a scheduler per core, giving each process its turn, moving processes between schedulers to balance throughput and power consumption, and so on.

Having tiny processes makes this easier. Armstrong once described the ease of scheduling small Erlang processes vs larger operating system processes:

Packing huge big rocks into containers is very very difficult, but pouring sand into containers is really easy. If you think of processes like little grains of sand and you think of schedulers like big barrels that you have to fill up, filling your barrels up with sand, you can pack them very nicely, you just pour them in and it will work.

And of course, the Erlang VM is pretty fast. As Armstrong once joked in a conference talk:

You take a program and you want it to go a thousand times faster, just wait ten years, and it goes a thousand times faster. So that’s solved. If you want it a million times faster, you wait twenty years. So that problem’s solved. And that’s why Erlang’s really fast today, because we waited a long time.

Erlang was created in the 1980s for telephones switches that handled “many tens of thousands of people” interacting simultaneously. As Armstrong implied, it rode the wave of ever-increasing chip speeds even as it continued to be optimized, so that its once-acceptable speed became truly formidable.

That wave has passed, and the future of performance is concurrency. To take advantage of it, we have to write concurrent programs. If Armstrong’s thesis is correct, “concurrency-oriented programming” is also the key to reliability.

And it’s awfully nice to get both at the same time.

Permalink

Erlang: Getting Started Without Melting

There are two things that might be meant when someone references “Erlang”: the language, and the environment (the EVM/BEAM and OTP). The first one, the language part, is actually super simple and quick to learn. The much larger, deeper part is learning what the BEAM does and how OTP makes your programs better.

It is clear that without an understanding of Erlang we’re not going to get very far in terms of understanding OTP and won’t be skilled enough to reliably interact with the runtime through a shell. So let’s forget about the runtime and OTP for a bit and just aim at the lowest, most common beginners’ task in coding: writing a script that tells me “Hello, World!” and shows whatever arguments I pass to it from the command line:

#! /usr/bin/env escript

% Example of an escript
-mode(compile).

main(Args) ->
    ok = io:setopts([{encoding, unicode}]),
    ok = io:format("Hello, world!~n"),
    io:format("I received the args: ~tp~n", [Args]).

Let’s save that in a file called e_script, run the command chmod +x e_script to make it executable, and take a look at how this works:

ceverett@takoyaki:~$ ./e_script foo bar
Hello, world!
I received the args: ["foo","bar"]
ceverett@takoyaki:~$

Cool! So it actually works. I can see a few things already:

  1. I need to know how to call some things from the standard library to make stuff work, like io:format/2
  2. io:setopts([{encoding, unicode}]) seems to makes it OK to print UTF-8 characters to the terminal in a script
  3. An escript starts execution with a traditional main/1 function call

Some questions I might have include how or why we use the = for both assignment and assertion in Erlang, what the mantra “crash fast” really means, what keywords are reserved, and other issues which are covered in the Reference Manual (which is surprisingly small and quick to read and reference).

An issue some newcomers encounter is that navigating an unfamiliar set of documentation can be hard. Here are the most important links you will need to know to get familiar and do useful things with the sequential language:

This is a short list, but it is the most common links you’ll want to know how to find. It is also easy to pull up any given module for doing a search for “erlang [module name]” on any search engine. (Really, any of them.)

In the rare case that erlang.org is having a hard time I maintain a mirror of the docs for various Erlang release versions here as well: http://zxq9.com/erlang/

Start messing with sequential Erlang. Don’t worry about being fancy and massively concurrent or maximizing parallelization or whatever — just mess around at first and get a feel for the language using escript. It is a lot of fun and makes getting into the more fully encompassing instructional material much more comfortable.

Permalink

On Graphvix - Postscript

I had intended to end this blog series with the previous post, but my own personal yak-shaving led to wanting to add one more feature (for now) to Graphvix, and so I wanted to write it up for any readers I might have, and also for myself. This feature is HTML Records, which allow the user to create table-style records (as we saw in parts 7 and 8) using HTML <table> markup.

To start off, let’s take a look at what a record like this might look like in DOT notation:

html_record [label=<<table>
  <tr>
    <td>a</td>
    <td>b</td>
  </tr>
  <tr>
    <td>c</td>
    <td>d</td>
  </tr>
</table>>,shape="plaintext"]

And here’s what that looks like

basic.png

And something a little more complicated:

html_record [label=<<table border="0" cellborder="1" cellspacing="0">
  <tr>
    <td rowspan="3" color="blue"><font point-size="24" color="#ff0000">hello</font><br/>world</td>
    <td colspan="3">b</td>
    <td rowspan="3" border="3">g</td>
    <td rowspan="3">h</td>
  </tr>
  <tr>
    <td>c</td>
    <td port="here">d</td>
    <td bgcolor="green">e</td>
  </tr>
  <tr>
    <td colspan="3" cellpadding="10">f</td>
  </tr>
</table>>,shape="plaintext"]

which produces the resulting output.

complex.png

You can go even further here. It is possible to nest tables inside one another, to change font, to explicitly set text alignment, to have background colors form a gradient, and more. However, let us begin by writing an API that will let us generate the first, simple table shown above. Once we have done that, much like more generic nodes and edges in earlier posts in this series, we will see that creating more complex examples requires little more than passing additional parameters to set as attributes, not a large expansion of the API.

Graphvix API

Since this is a different type of record than the row and column based ones we’ve looked at before, it makes sense to begin by writing a new module to contain the logic we need:

defmodule Graphvix.HTMLRecord do
  defstruct [
    rows: [],
    attributes: []
  ]

  def new(rows, attributes \\ []) when is_list(rows) do
    %__MODULE__{rows: rows, attributes: attributes}
  end
end

Simple enough. And no harder to start building the API for adding rows and cells:

defmodule Graphvix.HTMLRecord do
  # in the Graphviz HTML record notation `tr`s cannot have attributes,
  # hence the lack of a keyword list here
  def tr(cells) when is_list(cells) do
    %{cells: cells}
  end

  def td(label, attributes \\ []) do
    %{label: label, attributes: attributes}
  end
end

Great! Now we can build out a full HTMLRecord. Let’s start by creating our first, simple example from the beginning of the post:

alias Graphvix.HTMLRecord
import HTMLRecord, only: [tr: 1, td: 1, td: 2]

record = HTMLRecord.new([
  tr([
    td("a"),
    td("b")
  ]),
  tr([
    td("c"),
    td("d")
  ])
])

And it turns out, in order to implement our more complex example using this API, we only need to add two more functions, font/2 and br/0

defmodule Graphvix.HTMLRecord do
  ...
  # We'll add the `tag: "font"` element here to differentiate
  # from a `td` tag
  def font(label, attributes \\ []) do
    %{label: label, attributes: attributes, tag: :font}
  end

  def br, do: %{tag: :br}
end

And then, in Elixir:

HTMLRecord.new([
  tr([
    td([
      font("hello", point_size: 24, color: "#ff0000"),
      br(),
      "world"
    ], rowspan: 3, color: "blue"),
    td("b", colspan: 3),
    td("g", rowspan: 3, border: 3),
    td("h", rowspan: 3)
  ]),
  tr([
    td("c"),
    td("d", port: "here"),
    td("e", bgcolor: "green")
  ]),
  tr([
    td("f", colspan: 3, cellpadding: 10)
  ])
], border: 0, cellborder: 1, cellspacing: 0)

.to_label/1

Now that we have our API worked out, we need one final piece: converting it all to proper .dot notation. As with the records we looked at previously, our task is to convert an HTMLRecord into label text and attributes that can be fed into our existing functions for adding vertices to a graph. Since a table is a series of tr elements, which are each themselves a series of td elements, we can sketch out the basic shape of our .to_label function and its helper functions:

defmodule Graphvix.HTMLRecord do
  ...
  def to_label(%__MODULE__{rows: rows, attributes: attributes}) do
    # convert all the rows to their string representation, and
    # wrap those in a `<table>` declaration + attributes
  end

  defp tr_to_label(%{cells: cells}) do
    # convert each cell to its string representation, and
    # wrap those in a `<tr>` declaration
  end

  defp td_to_label(%{label: label, attributes: attributes}) do
    # convert the label to its string representation, and
    # wrap it in a `<td>` declaration + attributes
  end

  defp attributes_for_label(attributes) do
    # helper method to convert a Keyword list of attributes
    # into a properly formatted string.
  end
end

Now let’s flesh these out. Because each function depends on those defined after it, let’s start at the bottom with the smallest piece.

attributes_for_label/1

This function takes a keyword list of attribute names and values, and returns them in a string formatted as HTML tag attributes:

key1=“value1” key2=“value2”

There are two possibilities for the argument attributes passed to this function

  1. it is an empty list, in which case we want to return an empty string
defp attributes_for_label([]), do: ""
  1. it is not an empty list, in which case we need to map over the key/value pairs, format them, and join them together.
defp attributes_for_label(attributes) do
  attr_str = attributes
  		|> Enum.map(fn {key, value} -> ~s(#{key}="#{value}") end)
      |> Enum.join(" ")
  " " <> attr_string
end

If you’re not familiar with it ~s(), also called sigil_s is a function that returns a string, allowing interpolation, but providing an alternative to " or ' as the delimiters, allowing those characters to be used inside the string without escaping them.

You’ll also notice that we return the attribute string with an empty space prepended. Since we will be interpolating the return value of this function into a string describing an HTML tag, we want there to be a space between the tag and the attributes if there are any attributes, but no space between the tag and the closing > otherwise. A quick example may illustrate this better than words:

"<table#{attributes_for_label()}>" #=> <table>
"<table#{attributes_for_label(border: 0)}>" #=> <table border="0">

Next we move on to the next function up the hierarchy:

td_to_label/1

The basic shape of this function, as we defined it above, is straightforward:

def td_to_label(%{label: label, attributes: attributes}) do
  [
    "<td#{attributes_for_label(attributes)}>",
    label_to_string(label),
    "</td>"
  ] |> Enum.join("") |> indent()
end

indent/1 is a small helper function I wrote to, well, indent a string (or indent each line in a multiline string).[1] While this is not necessary, it provides cleaner output that is easier to parse, and looks much nicer when written to a file.

The trickiest part here is in label_to_string/1, since it the label argument can be one of

  1. a string
  2. a list of mixed elements (string, <br> tags, and <font tags)
  3. a whole other HTMLRecord struct

We will return to label_to_string/1 shortly, but first let’s finish working our way up the function chain until we reach the top-level HTMLRecord.to_label/1. With that in mind, onward to:

tr_to_label/1

This is possibly the simplest of the functions we need to write here. Since <tr> tags do not have attributes, and every <td> tag it contains is generated by td_to_label/1, the entire function definition looks like this:

def tr_to_label(%{cells: cells}) do
  [
    "<tr>",
    Enum.map(cells, &td_to_label/1),
    "</tr>"
  ] |> List.flatten |> Enum.join("\n")

Since td_to_label/1 calls indent on the <td> tag it generates, the output from this function would look something like:

<tr>
  <td>a</td>
  <td>b</td>
</tr>

And that’s all we need to write our top-level HTMLRecord.to_label/1 function. Let’s take a look:

to_label/1

def to_label(%{rows: rows, attributes: attributes}) do
  [
    "<<table#{attributes_for_label(attributes)}>",
    Enum.map(rows, fn row -> row |> tr_to_row() |> indent() end),
    "</table>>"
  ] |> List.flatten |> Enum.join("\n")
end

Because indent/1 knows how to indent each line of a multiline string, we can pass the entire string value of a <tr> tag to it and know that each enclosed <td> tag will also be indented another space. A simple example might look something like this:

<<table>
  <tr>
    <td>a</td>
    <td>b</td>
  </tr>
  <tr>
    <td>c</td>
    <td>d</td>
  </tr>
</table>>

which you might recognize as being the label value for our first example table at the beginning of this post. Look at that!

You may have noticed that the entire table’s HTML is wrapped in a pair of < and >. This is part of the .dot syntax, wrapping an HTML label in brackets instead of double quotes.

With that in hand, let’s jump back to label_to_string/1

label_to_string/1

To refresh, the label we pass to this function can be one of

  1. a string
  2. a list of mixed elements (string, <br> tags, and <font tags)
  3. an HTMLRecord struct

The first case is the simplest to write. If it’s already a string, just return the string unmodified def label_to_string(str) when is_bitstring(str), do: str

If we receive a list, we just want to iterate recursively over each element, until each element is itself a string, and then we can join them and move on

def label_to_string(list) when is_list(list) do
  Enum.map(list, &label_to_string/1) |> Enum.join("")
end

This requires us to define label_to_string for <br/> and <font> tags. Remember we wrote helper functions above to generate these, in which we added a tag key to the map. We can use that now to help pattern match these cases

def label_to_string(%{tag: "br"}), do: "<br/>"
def label_to_string(%{tag: "font", label: label, attributes: attributes}) do
  [
    "<font#{attributes_for_label(attributes)}>",
    label_to_string(label),
    "</font>"
   ] |> Enum.join("")
end

Remember, the label we pass to a font tag could be another list of smaller elements.

Finally, if we’re nesting an entire table inside a cell, we can just reuse HTMLRecord.to_label/1, can’t we?

def label_to_string(table = %HTMLRecord{}) do
  HTMLRecord.to_label(table)
end

Not quite! Remember that to_label currently wraps the entire table in an extra pair of <...>, which we don’t want in this case. I won’t go through an entire code example, but suffice it to say this results in invalid syntax, which results in not displaying the node at all. This is easy enough to fix by using the trick we used previously to determine if a row in a record was at the top level of the label. This requires just a small change to to_label/1:

def to_label(%{rows: rows, attributes: attributes}, top_level \\ true) do
  {extra_opening_tag, extra_closing_tag} = case top_level do
    true -> {"<", ">"}
    false -> {"", ""}
  end
  [
    "#{extra_opening_tag}<table#{attributes_for_label(attributes)}>",
    Enum.map(rows, fn row -> row |> tr_to_row() |> indent() end),
    "</table>#{extra_closing_tag}"
  ] |> List.flatten |> Enum.join("\n")
end

and a small change to label_to_string/1

def label_to_string(table = %HTMLRecord{}) do
  HTMLRecord.to_label(table, false)
end

Now there is only one piece left: adding an HTMLRecord to our graph.

Graphvix.Graph.add_html_record/2

After all the buildup, this turns out to be a rather simple function:

defmodule Graphvix.Graph do
  ...
  def add_html_record(graph, record) do
    label = HTMLRecord.to_label(record)
    attributes = [shape: "plaintext"]
    add_vertex(graph, label, attributes)
  end
end

We take the record, generate the label, and pass that along to our existing add_vertex/3 function. The only additional work we have to do is set the shape attribute of the node to plaintext. This keeps DOT from using the default oval shape and surrounding the table with an oval. Other than that, all other styling attributes are included in the generated HTML.

There’s only one problem left. Our vertices_to_dot/1 function passes the vertex’s label through attributes_to_dot/1, which unconditionally wraps every attribute value in a pair of double quotes. Currently, that includes our HTML label, which is already wrapped in <...>. If we generate a .dot file as is and compile, we end up with a node that looks something like this

error.png

Which is decidedly not what we want. Fortunately, we can add a second pattern match to our attribute_to_dot/2 function which looks for this case and does not include the double quotes:

def attribute_to_dot(:label, value = "<<table" <> _) do
  ~s(label=#{value})
end
# our existing, default implementation
def attribute_to_dot(key, value) do
  ~s(#{key}="#{value}")
end

Recompiling, we get a node that looks much more like what we were hoping for:

basic.png

Edges and ports

As we saw in an example above, giving a cell a port is as simple as passing it is as one of the attributes for td/2. The function add_html_record/2 returns the id of the vertex, along with the graph, so that in conjunction with any port names, can be used to draw edges to and from an HTML table node the same as any other type of node.

Wrapping up

Ok, that was a lot! Thanks for sticking through it with me.

With this post, my series on Graphvix is truly coming to a close. I won’t say it’s the last time I’ll write about this library, as there is additional functionality that could be added, but for now, I’m ready to release v1.0.0 of Graphvix.

It can be found on Github and Hex.pm. Documentation can be found online here.

If you’ve enjoyed this post, or this series more generally, please share it with your friends who might be interested in Elixir or Graphviz. And follow me on Twitter to keep up to date with future blogging.

Permalink

On Graphvix - Part 9

The final1 piece of the Graphvix spec to look at is the ability to rank nodes and subgraphs in order to enforce vertical or horizontal hierarchies in a graph. The functionality to make this happen is actually already in Graphvix, but since it can be such an important piece of structuring a graph, I wanted to call a bit of extra attention to it.

As an example, let’s draw a very simple family tree, with years of birth, of my immediate family. To do this, we need to make sure a timeline of years is ordered correctly along the left side, and that each family member is aligned with their birth year. Let’s begin by creating the vertices:

g = Graph.new
{g, y53} = Graph.add_vertex(g, "1953")
{g, y56} = Graph.add_vertex(g, "1956")
{g, y78} = Graph.add_vertex(g, "1978")
{g, y79} = Graph.add_vertex(g, "1979")
{g, y85} = Graph.add_vertex(g, "1985")
{g, y89} = Graph.add_vertex(g, "1989")
{g, y2015} = Graph.add_vertex(g, "2015")
{g, y2018} = Graph.add_vertex(g, "2018")
{g, michael} = Graph.add_vertex(g, "Michael") # myself
{g, lauren} = Graph.add_vertex(g, "Lauren") # wife
{g, rachel} = Graph.add_vertex(g, "Rachel") # sister
{g, corey} = Graph.add_vertex(g, "Corey") # brother-in-law
{g, howard} = Graph.add_vertex(g, "Howard") # father
{g, jessi} = Graph.add_vertex(g, "Jessi") # mother
{g, h_and_j} = Graph.add_vertex(g, "Howard + Jessi")
{g, l_and_m} = Graph.add_vertex(g, "Lauren + Michael")
{g, r_and_c} = Graph.add_vertex(g, "Rachel + Corey")

Then we create the edges necessary to order the timeline

{g, _} = Graph.add_edge(g, y53, y56)
{g, _} = Graph.add_edge(g, y56, y78)
{g, _} = Graph.add_edge(g, y78, y79)
{g, _} = Graph.add_edge(g, y79, y85)
{g, _} = Graph.add_edge(g, y85, y89)
{g, _} = Graph.add_edge(g, y89, y2015)
{g, _} = Graph.add_edge(g, y2015, y2018)

Now we create a set of subgraphs, each containing a year vertex and the vertex or vertices of births or marriages that occurred in that year. We give each subgraph a property of rank=same to print all the vertices in the subgraph at the same rank:

{g, _} = Graph.add_subgraph(g, [y53, howard], rank: "same")
{g, _} = Graph.add_subgraph(g, [y56, jessi], rank: "same")
{g, _} = Graph.add_subgraph(g, [y78, lauren], rank: "same")
{g, _} = Graph.add_subgraph(g, [y79, h_and_j], rank: "same")
{g, _} = Graph.add_subgraph(g, [y85, michael], rank: "same")
{g, _} = Graph.add_subgraph(g, [y89, rachel, corey], rank: "same")
{g, _} = Graph.add_subgraph(g, [y2015, r_and_c], rank: "same")
{g, _} = Graph.add_subgraph(g, [y2018, l_and_m], rank: "same")

And, finally, we add edges to show relationships. Blue represents a partner in a marriage, and black represents a parent -> child relationship.

{g, _} = Graph.add_edge(g, howard, h_and_j, color: "blue")
{g, _} = Graph.add_edge(g, jessi, h_and_j, color: "blue")
{g, _} = Graph.add_edge(g, h_and_j, michael)
{g, _} = Graph.add_edge(g, h_and_j, rachel)
{g, _} = Graph.add_edge(g, rachel, r_and_c, color: "blue")
{g, _} = Graph.add_edge(g, corey, r_and_c, color: "blue")
{g, _} = Graph.add_edge(g, lauren, l_and_m, color: "blue")
{g, _} = Graph.add_edge(g, michael, l_and_m, color: "blue")

Let’s take a look at the resulting graph:

tree.png

Not the prettiest graph in the world, but it does what we set out to have it do. For the most part, that’s all there really is to ranks in DOT. But, in addition to the rank of same, we can pass in min, max, source or sink. Using our example above, we can see what effect these ranks have.

If a subgraph’s rank is set to min, every vertex in the subgraph will be on at least as low* a rank as any other vertex in the graph.

*In this instance, low is relative to the rank direction of the graph. The default is top-to-bottom, so a lower rank is towards the top of the graph.

For example, if I wanted my own name to be at least as high on the graph as anything else, I could change the line to read

{g, _} = Graph.add_subgraph(g, [y85, michael], rank: "min”)

And the resulting graph looks like

tree_min.png

If I wanted my name to be at a lower rank than any others, I would use source instead of min, resulting in the graph below:

tree_source.png

max and sink, work analogously to min and source respectively, but in the opposite direction, ensuring a rank higher than (or equal to, in the case of max) any other vertex in the graph.

rankdir

As I said above, the default direction for graph ranking is top-to-bottom. This can be set explicitly by adding a global property of rankdir=TB to the graph2

Graph.set_graph_property(g, :rankdir, "TB")

This results in a graph that looks identical to our first graph. As you could probably guess, there are four values for rankdir: TB, BT, LR and RL. These behave as you might expect. For example, here is what our family tree looks like with rankdir=LR

tree_LR.png

And, if we were to set rank=source for my name and birth year, we see, again that the subgraph is shifted to the lowest possible rank, relative to the new rankdir:

tree_LR_source.png

And with that, we’ve reached the end of my series on Graphvix. We started with an overly complex library using GenServers to hew too close to a traditional object oriented approach to the problem, and have ended with a library that not only has more functionality than our original, but also works much more in the style and spirit of Elixir. Thank you for joining me on this journey.

This is far from a complete survey of the DOT language syntax. There are properties for graphs, subgraphs, nodes and edges that I have not touched on in this series. However, they can be added, just as the properties we have looked at, using the API we have written. A more complete overview of the DOT language, which I used as a resource while writing Graphvix and this blog series, can be found as a PDF here.

  1. Not actually the final post. There will be a postscript on HTML-style records following this. 

  2. This function, and the associated functions to display these properties, need to be created at this point, an oversight on my part earlier in the implementation. However, as they have much in common with other functions which already exist, there is little need to describe them in any detail. The Graphvix code containing these additions can be found tagged here

Permalink

On Graphvix - Part 8

In the last post, we looked at how records and ports work in DOT notation, and sketched out an API for incorporating them into Graphvix. In this post, we’ll dive deeper into the Elixir implementation.

What we’ve covered

In the previous post, I wrote out a list of tasks necessary to incorporate records into Graphvix. Let’s revisit that list now and see what we’ve already handled

  1. create a Record struct, with an API to easily create rows and columns
  2. attach port names to cells in a Record struct
  3. generate correct DOT output for nodes with shape=record
  4. generate edge definitions using record ports

It looks like we finished tasks 1 and 2 in Part 7, so our next step is to take one of the Record structs we created and write code to correctly translate it into a DOT string representation.

Record into DOT

Let’s review the function signature of Graph.add_vertex/3:

def add_vertex(graph, label, attributes \\ [])

and the function signature of Record.new/2 is

def new(body, properties \\ [])

In order to convert a Record into a form that can be passed to add_vertex/3, we need to do two things:

  1. Generate the full label from the body attribute of the record node
  2. append [shape: “record] to the record node’s attributes

To make a clean API, let’s define Graph.add_record(graph, record), where the record argument is a Record struct. This function would be responsible for the two steps above, and passing those new values on to Graph.add_vertex/3:

def add_record(graph, %Record{body: body, attributes: attributes}) do
  label = Record.to_label(body)
  attributes = Keyword.put(attributes, :shape, "record")
  add_vertex(graph, label, attributes)
end

We’ve passed the responsibility of generating the label — the most involved part of this functionality — to the Record module, but this is in keeping with the Elixir mindset of keeping each function as simple and self-contained as possible.

Let’s turn our attention to label_from/1 in Record

def to_label(%{body: body}) when is_bitstring(body) do
  body
end
def to_label(%{body: subset = %RecordSubset{}}) do
  RecordSubset.to_label(subset, true)
end

The true we pass into the second definition of to_label is to let our function know that this first subset, whether a row or column, represents is the outer-most orientation in the record. This is important because at the top level, a row has no wrapping around it, while a column at the top level, or any nested row or column, is surrounded by { … }.

Once again, we have deferred functionality down another level, to the RecordSubset module. Again, we have two possibilities we need to handle. First, our subset is a row, and it is at the top level of the record. In that case, we need to generate a DOT string that is not surrounded by braces. Our second condition is any other case, in which case we want to ensure this subset is enclosed in the braces.

def to_label(subset, top_level \\ false)
def to_label(%{cells: cells, is_column: false}, _top_level = true) do
  Enum.map(cells, &_to_label/1) |> Enum.join(" | ")
end
def to_label(%{cells: cells}, _top_level) do
  [
    "{",
    Enum.map(cells, &_to_label/1) |> Enum.join(" | "),
    "}"
  ] |> Enum.join(" ")
end

Once again, deferment. Each cell of a subset can take one of three forms: it can be a string, a tuple represented a cell with a named port, or another subset. To handle these cases, we create the private method _to_label/1:

def _to_label(cell) when is_bitstring(cell), do: cell
def _to_label({port_name, cell}) do
  "<#{port_name}> #{cell}"
end
def _to_label(subset = %RecordSubset{}) do
  RecordSubset.to_label(subset)
end

If the cell is a string, the function needs only to return the string. If the cell has a named port, we interpolate the port name and the cell contents into the correct DOT notation. If the cell is a nested subset, we pass it back to the public function to_label/2, with the second argument (whether or not this subset is at the top level) defaulting to false.

An example

Before moving on to drawing edges to and from records and ports, let’s test out our new functions with a simple example. Using our sample from the end of the previous post, we now have the code written to generate a full DOT file out of it:

g = Graph.new()
{g, v1} = Graph.add_vertex(g, "normal node")
r = Record.new(["a", "b", Record.column([{"c_port", "c"}, "d"])])
{g, v2} = Graph.add_vertex(g, r)
Graph.to_dot(g)
"""
digraph G {

  v0 [label="normal node"]
  v1 [label="a | b | { <c_port> c | d }",shape="record"]

}
"""
Graph.write(g, "test.png")

The output of the final line of code gives us this image:

nodes.png

Edges

The final piece of this work, step 4 from the list at the top of the post, is to be able to generate edges that correctly connect records and ports.

As with add_vertex above, let’s review add_edge

def add_edge(graph, out_from, in_to, atttributes \\ [])

Right now out_from and in_to are the vertex ids generated by add_vertex. The simplest solution would be to pass, instead of a single id, a tuple of {vertex_id, port_name}. Let’s see what happens if we try to pass a tuple like that into :digraph.add_edge directly:

iex> digraph = :digraph.new()
iex> v0 = :digraph.add_vertex(digraph)
iex> v1 = :digraph.add_vertex(digraph)
iex> :digraph.add_edge(digraph, {v0, "port"}, v1)
{:error, {:bad_vertex, {[:"$v" | 0], "port"}}}

Uh oh! Looks like it’s not going to be that simple. What if we construct the full out_from value in correct DOT syntax and pass that in to add_edge?

iex> :digraph.add_edge(digraph, "v0:port", v1)
{:error, {:bad_vertex, "v0:port"}}

The same error! Indeed, looking at the :digraph documentation, it seems that our values for out_from and in_to need to be ids of vertices that already exist in the graph. So, how do we make sure ports, if they exist in the edge definition, are included in the final output?

The solution I came up with is to use the attributes Keyword list to temporarily store values I’ve named outport and inport. We’ll write a slightly modified version of add_edge that sets these values, and update the definition of edges_to_dot to check for, remove, and use these values if they exist. Let’s write some code!

def add_edge(graph, out_from, in_to, attributes \\ [])
def add_edge(graph, {id = [:"$v" | _], port}, in_to, attributes) do
  add_edge(graph, id, in_to, Keyword.put(attributes, :outport, port))
end
def add_edge(graph, out_from, {id = [:"$v" | _], port}, attributes) do
  add_edge(graph, out_from, id, Keyword.put(attributes, :inport, port))
end
def add_edge(graph, out_from, in_to, attributes) do
  eid = :digraph.add_edge(graph.digraph, out_from, in_to, attributes)
  {graph, eid}
end

With a function signature to show the default value for attributes, and our existing function definition at the bottom, this is the complete definition for add_edge. If one or the other of vertex arguments passed in is a tuple with a port name, the function will step through one (or both) of the first two definitions, each time passing the correctly modified values along until we reach the final definition.

Next, let’s look at how these values get used in edges_to_dot

defp edges_to_dot(graph) do
  [_, etab, _] = digraph_tables(graph)
  elements_to_dot(etab, fn edge = {_, [:"$v" | v1], [:"$v" | v2], attributes} ->
    case edge in edges_contained_in_subgraphs(graph) do
      true -> nil
      false ->
        v_out = edge_side_with_port(v1, Keyword.get(attributes, :outport))
        v_in = edge_side_with_port(v2, Keyword.get(attributes, :inport))
        attributes = attributes |> Keyword.delete(:outport) |> Keyword.delete(:inport)
        [
          "#{v_out} -> #{v_in}",
          attributes_to_dot(attributes)
        ] |> compact() |> Enum.join(" ") |> indent()
    end
  end)
end

defp edge_side_with_port(v_id, nil), do: "v#{v_id}"
defp edge_side_with_port(v_id, port), do: "v#{v_id}:#{port}"

elements_to_dot and edges_contained_in_subgraphs we have discussed in previous posts. The important code here is in the false branch of the case statement.

Using our helper method edge_side_with_port/2, we pass in the appropriate vertex id, and the result of looking in our attributes list for the correct port name, and have returned to us the correct syntax for that half of our edge definition, whether or not there is a port name attached.

Then we delete those two keys and their values, if they exist, from the attributes list, to ensure they do not appear in the printed out list of DOT attributes associated with the edge.

Let’s revisit our example from above, and include a couple edges to show this new code working as we expect it to:

g = Graph.new()
{g, v1} = Graph.add_vertex(g, "normal node")
r = Record.new(["a", "b", Record.column([{"c_port", "c"}, "d"])])
{g, v2} = Graph.add_vertex(g, r)
Graph.to_dot(g)
{g, _e1} = Graph.add_edge(g, v1, v2)
{g, _e2} = Graph.add_edge(g, {v2, "c_port"}, v1, color: "green")
"""
digraph G {

  v0 [label="normal node"]
  v1 [label="a | b | { <c_port> c | d }",shape="record"]

  v0 -> v1
  v1:c_port -> v0 [color="green"]

}
"""
Graph.write(g, "test.png")

edges.png

And voila! We can see the green edge pointing out from the “c” cell of our record.

The code up until this point is tagged here: GitHub - mikowitz/graphvix at v1.0.0.pre.records

Between this and the previous post, we’ve covered a lot of ground, both conceptually and in code. There is one final piece of Graphvix to write that lets us manipulate the final layout of a graph, aligning related vertices and subgraphs to create timelines, defined hierarchical structures, etc. That is the concept of rank, which we’ll be exploring in our next, and probably final, post in this series. Thanks for reading this far, and I’ll see you back here for the next post.

Permalink

Scaling reliably during the World’s biggest sports events

Erlang Solutions works with some of the world’s leading online gambling & betting operators. See what we do.

“A busy day for an online betting shop is like Black Friday at Amazon, only prices change every second,” Martin Davies, Head of Technology at bet365 told us a year ago. During big sporting events, like football tournaments, that can go on for weeks with global interest and audience numbers, Amazon’s Black Friday pales in comparison.

Operators need to scale and innovate while serving massive numbers of punters on a daily and hourly basis. Here’s our tips on how to do that:

The innovation-scale dilemma:

the biggest challenge facing online bookmakers

Bookmakers must not only meet the demands of high frequency requests and traffic bursts, but also keep the tech-savvy, multi-channel customers engaged. This is where they come across their biggest challenge; the “innovation-scale dilemma.”

To meet user demand, the industry continuously creates and improves on new types of games and bets. This is when tech departments hit upon the dilemma. Over a certain number of users, the ability to innovate quickly is lost. Precious time is spent rewriting software so that it runs to scale, slowing release time.

To break this vicious cycle, the most innovative industry players – such as bet365 and William Hill – have chosen Erlang (or Elixir, a language based on the same underpinnings as Erlang) as their core technology. Erlang Solutions also supports global payment companies, like Klarna, messaging companies like ooVoo, and IoT companies such as Intel – all of whom face similar technical challenges to online bookmakers.

Read more on SBC News

Innovation at scale

“Innovate or die” is a reality of life for most consumer facing industries these days. As the pace of progress in the computing industry shows no signs of slowing down, disruption by innovative startups is an existential threat to established players in any market.

Online businesses are especially ripe for disruption as the tools and technologies required to start an online business are pretty much available for free with no strings attached. The online gambling and betting industry is no different. Given the recent surge in popularity of online gambling, especially during the World Cup, established players have to innovate just to stay where they are in the marketplace.

Every step of the way, user experience comes first. There is a relentless demand for it to be easier and smoother, regardless of backend challenges. This has left the gambling industry in search of the most efficient and effective ways of directing users to their objective, whether it’s registering, playing a game, placing a bet, cashing out that bet or making a deposit. Systems must be ultra-responsive and able to engage with customers when, where, and how they want.

Those who get this ‘right’ in comparison to their competitors enjoy higher customer engagement and retention, and hence profits. And for each operator, this journey to support customer-centricity through the tech stack is slightly different.

Some established operators have legacy systems that are hard to scale or modernise whilst maintaining uptime and quality of customer experience. Those operators would be wise to take a step back, review their technology choices and evaluate which of the bewildering range of tech, old and new, can best be deployed in helping them achieve these customer-centric goals. It can help to spend some time imagining what an ideal sportsbook architecture would look like, without the burden of legacy, and then mapping out an evolutionary roadmap to get there.

Read more on SBC News

Concurrency, scalability, and reliability

At any one time bet365’s systems are serving many hundreds of thousands of users live odds and results, while managing multiple data streams on the backend. In peak times, like the Super Bowl or the Grand National, the number of users swells by an order of magnitude. This is an awesome feat of concurrent engineering.

When it comes to concurrency, Erlang and Elixir, both built on the BEAM virtual machine, excel. They are both concurrency oriented functional languages, built to handle vast numbers of users at the same time. Users of Erlang across verticals as diverse as telecoms, adtech; financial payments; massive multiplayer online role playing gaming; and social media have all exploited its ability to provide impressive concurrency.

Modern betting infrastructures demand massive scalability, speed, and fault tolerance to run smoothly. Without scalability built into the system, operators can be left to rely on additional hardware to scale vertically, which is not a sustainable nor a cost effective approach.

Erlang and Elixir’s abilities around concurrency go hand-in-hand with their massive scalability. If Erlang isn’t the best tool for every job your system has, a distributed architecture that can plug into Erlang-based products can make your betting system quick to deploy and scale elastically. Tools built in Erlang, such as Riak KV, scale out, up and down predictably. Say goodbye to the headache of emergency hardware and unpredictable system performance under severe load.

Like the stock market, downtime for an online betting operator has immediate financial and reputational consequences. For online sports betting the provision of a ultra-reliable, real-time service is now a priority for bookmakers and punters alike. Just look at the booming in play market for sports events.

In this world, pauses of any kinds, for system failure, garbage collection, or queuing backlogs, are not acceptable. Online betting stacks must handle their constant torrents of data without impacting the system’s processes, or end users.

Erlang and Elixir’s concurrency, no-shared memory architecture and built-in ‘fail and recover’ approach make them behave extremely gracefully and predictably under highly variable stochastic load. In fact, Erlang can support data changing at four times other languages’ rate. This makes Erlang and Elixir ideal to build critical gambling and betting systems on.

Read more on SBC News

The best tool for the job

The software development community is going through a revolution of diversity and inclusion as more developers move away from the biggest programming languages in search of the right tool for the nuanced challenges of our new, technologically-literate world. The result: it’s becoming rarer for there to be de facto programming languages that organisations are forced to use because that’s what their developers community uses.

One-size-fits-all stacks will never scale indefinitely. The most important lesson we’ve learned working in the gaming industry has been realising that a silver bullet for solving every problem simply does not exist, and it’s important to pick a proper tool for the job. Erlang and Elixir are superb at scaling reliably and efficiently when betting operators need it to, but also complement other tools and languages which are better suited to other aspects of a system.

One needs to look no further than bet365, whose Systems Development Manager for Middleware, Andrew Deane, recently went on the record to say “When I joined bet365 4 years ago, the company was in its second wave of Erlang development.

Having proved the model with Erlang, the door was open to explore other languages. Continuing the mantra of the right tool for the right job, we’ve found use cases for Google’s Go in our notifications systems and testing tools and more recently for Elixir in our application tooling.”

Read more on SBC News

Erlang Solutions works with some of the world’s leading online gambling & betting operators. See what we do.

Permalink

Google Summer of Code 2018 projects

Like previous years, the Elixir community is happy to participate in Google Summer of Code 2018. We are currently working on four different projects. Let’s have a look at them.

StreamData integration with typespecs

Student: Nikola Jichev

StreamData is a data-generation and property-based testing library for Elixir. The goal of this GSoC project is to integrate StreamData with typespecs.

The data-generation side of StreamData provides tools to generate random data through composable generators. For example, you could generate keyword lists like this:

import StreamData

keywords_generator = list_of({atom(:alphanumeric), term()})

Enum.take(keywords_generator, 2)
#=> [[_: [true]], [tm: 2, h: %{}]]

In many cases, it would be useful to be able to generate such random data starting from already existing or user-defined types. For example, Elixir already provides a built-in keyword/0 type for keyword lists defined roughly as:

@type keyword() :: [{atom(), any()}]

The goal of the first part of this GSoC project is to provide StreamData with the ability to create data generators from type definitions. The API is not yet defined, but in this case, it could look something like the following:

import StreamData

keywords_generator = from_type(keyword/0)

Enum.take(keywords_generator, 2)
#=> [[_: [true]], [tm: 2, h: %{}]]

In the second part of the GSoC project, the aim is to be able to property-test functions with specs automatically.

@spec has_key?(keyword(), atom()) :: boolean()
def has_key?(keyword, key) do
  # ...
end

The first part of the project focuses on generating data from types, so we know how to generate function arguments. The missing piece is validating that a given term belongs to a given type. For example, in the snippet above, we want to be able to check if a term is a boolean(). Once we’re able to do this, automatic spec validation will be straightforward: it will be a matter of generating random arguments for the given function, calling the function with those arguments, and asserting that the returned value belongs to the return type defined in the spec.

This kind of property-based testing doesn’t test for correctness. In the snippet above, has_key?/2 could be implemented to ignore arguments always return false and the automatic spec validation would pass since false is always a boolean. However, this is a kind of smoke testing useful for discovering inconsistencies in the arguments and return values of functions.

Tensorflex: Tensorflow bindings for the Elixir programming language

Student: Anshuman Chhabra

Currently, there is a lack of machine learning tools and frameworks for Elixir. With the number of programmers learning/using machine learning only set to grow, supporting machine learning capabilities is essential for any programming language. Moreover, there are discussions on ElixirForum regarding this and recent talks given at ElixirConf that reflect the need for Elixir to provide machine learning capabilities.

This project’s goal is Tensorflex, an Elixir machine learning framework similar to Keras for Python. Keras uses Tensorflow as a backend for doing all the machine learning. Tensorflex will use Using Native Implemented Functions (NIFs) and the Tensorflow C API as a backend to provide a low-level API. This low-level API will then be used to write a Keras-like framework in the form of a high-level API. This will allow Elixir developers to write expedient and efficient machine learning code in Elixir.

Dialyzer task for Elixir

Student: Gabriel Gatu

Dialyzer is a discrepancy analyzer that ships as part of Erlang/OTP. Currently, there are two projects that add Dialyzer support to Elixir applications: dialyxir and dialyzex. The goal of this project is to bring ideas from both projects into Elixir itself in order to make using Dialyzer in Elixir projects easier. The task we aim to add to Elixir will focus on two main features: better user experience (in particular, better error messages and formatting) and the ability to analyze projects incrementally.

ElixirBench

Student: Tallys Martins

ElixirBench aims to be a service to monitor performance of Elixir projects. The goal of the GSoC project is to bring ElixirBench up and have it run nightly performance monitoring of significant Elixir projects (including Elixir itself). The end goal is to have a platform that, given a project from GitHub, will monitor the performance of new releases of that project and look for performance regressions. The benchmarking process will be controlled through a configuration file that will specify the benchmark scripts to run.

We have high hopes for this tool as we see value in it for the whole community and for core Elixir projects alike.

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.