How I sped up my XML parser by 15 times

In the previous blog post we have showcased how possible optimizations can be made to binary parsing implementation in Elixir. This article aims to delineate how I have applied these techniques, in order to drastically speed up XML parsing.

Son Doong Cave

XML is one of the most widely used formats for data interchange on the web today. At Forza Football, in addition to JSON and MessagePack, we use XML in several facets of our work: receiving football data from providers, obtaining headlines from RSS feeds. Ensuring that we have an efficent method to parse the data we receive will boost the performance of the application as a whole and thus improve user experience. With speed and usability as my two ultimate goals, I started writing the first implementation of the Saxy project.

Not so “saxy”

After two months of work and two aborted proofs of concept, The first version was published to the world. Since speed was, from the beginning, the desired goal of this project, I decided to do some benchmarks on Saxy against other XML parsing libraries.

Benchmarking with XML is hard because the result highly depends on the complexity of the document, so a relatively simple XML file was picked. The chosen contestants were xmerl—the standard XML parser in Erlang OTP, Erlsom, and fast_xml. But in the end I could not get fast_xml to work as intended, so only xmerl and Erlsom went into the benchmark script.

  Microseconds per run
Saxy 0.3.0 24.4232
xmerl 1.3.16 299.242
Erlsom 1.4.1 65.8912

“Eureka! Eureka! Saxy is 2.7 times faster than erlsom and 12 times faster than xmerl 🎉🎉🎉” I excitedly screamed out.

But after taking some time to reflect and cool down, it turned out that I had a fatal mistake right in the benchmark script. When the script was fixed, reality slapped me really hard in the face.

  Microseconds per run
Saxy 0.3.0 943.401
xmerl 1.3.16 285.5068
Erlsom 1.4.1 66.8736

“Focus on speed … 😏, good that now you know how ignorant you are!”, I told my dispointed self.


I decided to give it another try, but with a different strategy this time. Instead of throwing away yet another proof of concept and writing the new implementation, I spent a few days pondering why the previous parser was so unacceptably slow. What I did was: 1) Study how Erlsom does it (thus learning from the winner) 2) I read the Erlang Efficiency Guide 3) And used our previous blog post as the guideline.

It became clear that there were some shortcomings in my original approach:

Sub-binary creation

Let’s look at a common rule matching function in Saxy.

def match(buffer, position, :element, state) do
    with {:ok, {:<, _token_value}, {new_buffer, new_pos}, new_state} <-
           match(buffer, position, :<, state),
         {:ok, {:Name, tag_name}, {new_buffer, new_pos}, new_state} <-
           match(new_buffer, new_pos, :Name, new_state),
         {:ok, {:SAttribute, attributes}, {new_buffer, new_pos}, new_state} <-
           zero_or_more(new_buffer, new_pos, :SAttribute, new_state, []),
         {:ok, {:S, _s_char}, {new_buffer, new_pos}, new_state} <-
           zero_or_one(new_buffer, new_pos, :S, new_state) do
      {:ok, {:element, tag_name}, {new_buffer, new_pos}, new_state}

A quick summary of what the code does: it tries matching < token, then :Name rule, :SAttribute, and finally :S. For each token/rule that is matched, it returns a four-element tuple representing the matched token/rule, the buffer and its current position, as well as the parsing state.

But why does it make the whole parsing flow slow? The answer is because it completely avoids the compiler from keeping the initial match context, because every match function returns its sub-binary. This topic has been extensively covered in the binary parsing optimization post.

Pattern Matching Mania

In Erlang, pattern-matching in functions, as well as in case and receive are usually optimized.

defmodule Foo do
  def bar([item]), do: item
  def bar({item}), do: item
  def bar([item | rest]), do: item

The generated Erlang VM instructions below will show us how the compiler rearranges the clauses and produces a more optimized code. See here for more information of how to generate these assembler codes.

{:function, :bar, 1, 10,
   {:label, 9},
   {:line, [{:location, 'lib/foo.ex', 2}]},
   {:func_info, {:atom, Foo}, {:atom, :bar}, 1},
   {:label, 10},
   {:test, :is_nonempty_list, {:f, 12}, [x: 0]}, % <—————————————
   {:get_list, {:x, 0}, {:x, 1}, {:x, 2}},
   {:test, :is_nil, {:f, 11}, [x: 2]}, % <—————————————
   {:move, {:x, 1}, {:x, 0}},
   {:label, 11},
   {:move, {:x, 1}, {:x, 0}},
   {:label, 12},
   {:test, :is_tuple, {:f, 9}, [x: 0]}, % <—————————————
   {:test, :test_arity, {:f, 9}, [{:x, 0}, 1]},
   {:get_tuple_element, {:x, 0}, 0, {:x, 0}},

But unfortunately, binary matching is one of the exceptions that the Erlang compiler does not rearrange clauses yet.

defp match_token(<<"-->", _rest::bits>>, :"-->") do
  {:ok, {"-->", 3}}

defp match_token(<<_::bits>>, :"-->") do

defp match_token(<<"-->", _rest::bits>>, :CommentChar) do

defp match_token(<<char::utf8, _rest::bits>>, :CommentChar) do
  {:ok, {<<char::utf8>>, byte_size(<<char::utf8>>)}}

# 50 more token matching.

defp match_token(<<charcode::utf8, _rest::bits>>, :HexChar) do
  if hex_char?(charcode) do
    char = <<charcode::utf8>>
    {:ok, {char, byte_size(char)}}

Pattern matching is absolutely one of the coolest features in Elixir/Erlang and makes writing conditional statements easier than ever, especially for someone who has background from other languages. However, like everything else in the Universe, using it excessively would eventually come back to bite us.

In the code above, every token type is passed as the second argument in the match_token’s clause. In order to match a :HexChar token, the Erlang VM has to try matching every clause of the function (because the first argument is matching binary) in a top-to-toe manner before it can reach the function genuinely doing the matching work. Working in this manner means costing a great deal of overheads and gaining nothing in return.

Binary construction

For better usability, Saxy was designed to emit SAX event data in binary, instead of a characters list, as in other libraries. This makes all the operations afterwards easier for the library users.

I am going to demonstrate how Saxy previously operated:

defp zero_or_more(buffer, position, rule, state, acc) do
  case match(buffer, position, rule, state) do
    {:ok, {^rule, value}, {new_buffer, current_pos}, new_state} ->
      zero_or_more(new_buffer, current_pos, rule, new_state, acc(acc, value))

    {:error, ^rule, {new_buffer, mismatch_pos}, new_state} ->
      {:ok, {rule, acc}, {new_buffer, mismatch_pos}, new_state}

defp acc(acc, value) when is_binary(acc), do: acc <> value

defp match_token(<<charcode::utf8, _rest::bits>>, :NameChar) do
  if name_char?(charcode) do
    char = <<charcode::utf8>>
    {:ok, {char, byte_size(char)}}

At first, I suspected that binary appending operations were creating a bottleneck caused by allocations, but it turned out that the Erlang VM is way smarter than I expected. Unlike binary matching which is optimized by the compiler, the optimization work, primarily to avoid copying, of binary appending is done by the runtime system, therefore the optimization only fails in very few circumstances.

The good thing about open source is that you can learn from any random person on the Internet by looking at their code. I peaked at how other libraries implemented this and learned that Jason took a fairly smart approach. As mentioned, binaries in Erlang are designed in a way that they can be referenced internally, a sub binary is only a reference into a part of another binary. So instead of constructing the result binary, we can slice it out from the original binary with Kernel.binary_part/3.

Make Saxy great again

Taking into account all of these learnings: continuous binary matching, reasonable pattern matching, sub binaries instead of new binary construction, I implemented the next version of Saxy.

def parse_open_tag_name(<<charcode, rest::bits>>, cont, original, pos, state, len)
    when is_name_char(charcode) do
  parse_open_tag_name(rest, cont, original, pos, state, len + 1)

def parse_open_tag_name(<<charcode::utf8, rest::bits>>, cont, original, pos, state, len)
    when is_name_char(charcode) do
  parse_open_tag_name(rest, cont, original, pos, state, len + compute_unicode_len(charcode))

def parse_open_tag_name(<<buffer::bits>>, cont, original, pos, state, len) do
  name = binary_part(original, pos, len)
  state = %{state | stack: [name | state.stack]}
  parse_sattribute(buffer, cont, original, pos + len, state, [])

This implementation has many improvements in comparison to the previous one:

  1. Sub-binaries in each parsing function are now continuously passed to the next without being returned or used. This prevents new ones from being created and keeps the original match context.
  2. Every rule now gets its own function, instead of the rule name being passed as the second argument and the VM does crazy clause evaluations.
  3. Data can be sliced out from the original binary using binary_part/3 without having to be constructed. Some times binary construction cannot be completely avoided, for example mixing character references in XML content "Tom &#x26; Jerry", in that case we can avoid binary appending by building iodata (["Tom ", 0x26, " Jerry"]).
  4. It also improves error handling. For example if the look-ahead code point is not an expected token, we can immediately return a parsing error, instead of having to do a throw/catch block in the previous approach.

And a happy ending

After applying all the optimizations above, I ran the benchmark again (with the correct script 🙈). And the result started to look very promising:

  Microseconds per run
Saxy 0.4.0 64.401
xmerl 1.3.16 285.5068
Erlsom 1.4.1 66.8736

THAT IS 14.7x SPEED UP (yeah I rounded it up a bit in the blog title).

Results with other samples and a proper benchmark library.

Benchmarking against Hackernoon RSS, Saxy is 1.59 times faster than Erlsom.

  IPS 99th %
Saxy 0.4.0 437.79 3.21 ms
Erlsom 1.4.1 275.23 5.03 ms

This speed is particularly noticeable with deeply nested XML4.35 times.

  IPS 99th %
Saxy 0.4.0 15.37 76.18 ms
Erlsom 1.4.1 3.53 294.30 ms

More detailed benchmark results can be found on Github.


I had many different ups and downs, and learned a bunch of things while building the library. I hope that this blog is a fun read and provides a real-world example on optimizing binary parsing.


Exploring the Compiler Using the ‘time’ Option

This is the first of a series of blog posts about the compiler. There will be blog posts about how the compiler works now, how it might work in the future, and some historical notes to explain why some things are what they are. In this blog post I will talk about one of the most useful options for exploring the compiler, namely the time option.

First let see time in action on a huge file with many functions and many variables so that the numbers get interesting:

$ erlc +time NBAP-PDU-Contents.erl
Compiling "NBAP-PDU-Contents"
 remove_file                   :      0.000 s       6.5 kB
 parse_module                  :      0.709 s   25146.1 kB
 transform_module              :      0.000 s   25146.1 kB
 lint_module                   :      0.426 s   25146.1 kB
 expand_records                :      0.086 s   25993.7 kB
 core                          :      0.675 s  282518.3 kB
 sys_core_fold                 :      1.566 s  237885.4 kB
 core_transforms               :      0.000 s  237885.4 kB
 sys_core_bsm                  :      0.205 s  238982.3 kB
 sys_core_dsetel               :      0.108 s  238982.3 kB
 v3_kernel                     :      0.950 s  305320.5 kB
 v3_life                       :      0.453 s  221354.8 kB
 v3_codegen                    :      0.896 s   75801.0 kB
 beam_a                        :      0.080 s   75561.2 kB
 beam_reorder                  :      0.049 s   75561.2 kB
 beam_block                    :      0.361 s   87171.9 kB
 beam_except                   :      0.041 s   81557.7 kB
 beam_bs                       :      0.097 s   79929.2 kB
 beam_type                     :      0.502 s   77270.5 kB
 beam_split                    :      0.042 s   75004.5 kB
 beam_dead                     :      0.356 s   77566.7 kB
 beam_jump                     :      0.232 s   73347.9 kB
 beam_peep                     :      0.164 s   73346.0 kB
 beam_clean                    :      0.150 s   73081.0 kB
 beam_bsm                      :      0.092 s   75473.2 kB
 beam_receive                  :      0.020 s   75473.2 kB
 beam_record                   :      0.023 s   75471.4 kB
 beam_trim                     :      0.042 s   75471.4 kB
 beam_flatten                  :      0.071 s   66745.5 kB
 beam_z                        :      0.019 s   66442.2 kB
 beam_validator                :      0.401 s   66442.2 kB
 beam_asm                      :      0.236 s       6.5 kB
 save_binary                   :      0.000 s       6.5 kB

When the time option is given, the compiler will print a line after executing each compiler pass. First on each line is the name of the compiler pass. Often, but not always, the name is the name of the Erlang module that implements the compiler pass.

The name is followed by the time (in seconds) that the compiler spent running that compiler pass. For smaller files, the time is usually zero or nearly zero. For this huge file, most of the times are non-zero. For example, the sys_core_fold pass needs about one and a half second to do its work.

The time is followed by the amount of memory used by that compiler pass.

In this blog post, I will just talk about a few of the compiler passes. There will be more about what the compiler passes do in later blog posts.

The remove_file pass is the very first pass run. It removes any existing BEAM file so that there will not be an outdated BEAM file in case the compilation fails. The last pass is the save_binary pass. It saves the binary with the BEAM code to the BEAM file.

Now let’s see how the output changes if we give the -S option:

$ erlc -S +time NBAP-PDU-Contents.erl
Compiling "NBAP-PDU-Contents"
 parse_module                  :      0.718 s   25146.1 kB
 transform_module              :      0.000 s   25146.1 kB
 lint_module                   :      0.420 s   25146.1 kB
 expand_records                :      0.088 s   25993.8 kB
 core                          :      0.671 s  282518.3 kB
 sys_core_fold                 :      1.564 s  237885.4 kB
 core_transforms               :      0.000 s  237885.4 kB
 sys_core_bsm                  :      0.203 s  238982.3 kB
 sys_core_dsetel               :      0.104 s  238982.3 kB
 v3_kernel                     :      0.964 s  305320.5 kB
 v3_life                       :      0.375 s  221354.8 kB
 v3_codegen                    :      1.044 s   75801.0 kB
 beam_a                        :      0.091 s   75561.3 kB
 beam_reorder                  :      0.044 s   75561.3 kB
 beam_block                    :      0.276 s   87171.9 kB
 beam_except                   :      0.028 s   81557.8 kB
 beam_bs                       :      0.103 s   79929.3 kB
 beam_type                     :      0.518 s   77270.5 kB
 beam_split                    :      0.049 s   75004.6 kB
 beam_dead                     :      0.379 s   77566.8 kB
 beam_jump                     :      0.195 s   73347.9 kB
 beam_peep                     :      0.156 s   73346.0 kB
 beam_clean                    :      0.168 s   73081.0 kB
 beam_bsm                      :      0.070 s   75473.2 kB
 beam_receive                  :      0.044 s   75473.2 kB
 beam_record                   :      0.021 s   75471.5 kB
 beam_trim                     :      0.041 s   75471.5 kB
 beam_flatten                  :      0.045 s   66745.5 kB
 beam_z                        :      0.016 s   66442.2 kB
 listing                       :      1.503 s   66442.2 kB

We can see how the list of passes has changed. The last pass run is now listing, which produces a listing of the BEAM assembly code in a .S file. The remove_file pass in the beginning is not run because no BEAM file is being produced and any existing BEAM file should be preserved.

Let’s try one of the many undocumented debugging options:

$ erlc +no_postopt +time NBAP-PDU-Contents.erl
Compiling "NBAP-PDU-Contents"
 remove_file                   :      0.000 s       6.5 kB
 parse_module                  :      0.706 s   25146.1 kB
 transform_module              :      0.000 s   25146.1 kB
 lint_module                   :      0.421 s   25146.1 kB
 expand_records                :      0.090 s   25993.8 kB
 core                          :      0.684 s  282518.3 kB
 sys_core_fold                 :      1.614 s  237885.4 kB
 core_transforms               :      0.000 s  237885.4 kB
 sys_core_bsm                  :      0.210 s  238982.3 kB
 sys_core_dsetel               :      0.105 s  238982.3 kB
 v3_kernel                     :      0.967 s  305320.5 kB
 v3_life                       :      0.353 s  221354.8 kB
 v3_codegen                    :      1.028 s   75801.0 kB
 beam_a                        :      0.091 s   75561.3 kB
 beam_clean                    :      0.201 s   73513.2 kB
 beam_z                        :      0.023 s   72897.9 kB
 beam_validator                :      0.467 s   72897.9 kB
 beam_asm                      :      0.396 s       6.6 kB
 save_binary                   :      0.001 s       6.5 kB

We can see that far fewer passes were run. The no_postopt option turns off all optimizations run on the BEAM code (i.e. all optimizations after v3_codegen).

So why is this time option useful?

  • When compilation of a module is very slow, time can show if any particular passes are bottlenecks (much slower than the other passes). In fact, a long time ago the compiler needed several minutes to compile the NBAP-PDU-Contents module that I have used an example in this blog post. The time option immediately pointed out the bottlenecks that I needed to fix.

  • If the compiler doesn’t terminate when compiling a certain module, time will show the last successfully run pass (the one before the culprit).

  • The compiler ignores options it doesn’t recognize, so if you misremember or misspell an option, the compiler will not do what you expect. Adding the time option can help you verify that the expected compiler passes are run.

Where are all those undocumented options documented?

There are many options meant for debugging that allow you skip certain optimization passes or to produce a listing of the code after a certain pass.

Most of these options can be shown by running compile:options/0 from the Erlang shell:

$ erl
Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V9.2  (abort with ^G)
1> compile:options().
dpp - Generate .pp file
'P' - Generate .P source listing file
dabstr - Generate .abstr file
debug_info - Run save_abstract_code
dexp - Generate .expand file
'E' - Generate .E source listing file
dcore - Generate .core file
clint0 - Run core_lint_module
doldinline - Generate .oldinline file
dcorefold - Generate .corefold file
dinline - Generate .inline file
dcopt - Generate .copt file

Points to Ponder

Why does the name of some compiler passes begin with v3? Follow this blog, and there might be an answer in a future blog post.


Erlang: R20.3 doc mirror

The main Erlang website has been super snappy for the last several months so I had slacked off on mirroring the documentation. Today it seems there are a few problems (DDOS-type symptoms, but I have no idea what is going on) so I’ve gone ahead and mirrored the R20 docs.

I also updated the “Erlang Stuff” page — though that page is going to get a few more changes once the Erlang tooling suite I’m working on is out, as now tinder, firekit, flint and zx are now all incorporated into the new (much better) thing… but more on that thing later once I’m done.


I/O polling options in OTP 21

Erlang/OTP 21 will introduce a completely new IO polling implementation. This new implementation comes with a new set of tuneable parameters that can be used to get the most out of your system. This blog post describes the parameters and attempts to describe what they should be used for.

The I/O polling framework in erts is responsible for delivering events to ports and processes that have subscribed to events on file descriptors. Before OTP 21 it was the job of an Erlang scheduler thread to deliver these events. In OTP 21 dedicated threads are used to deliver the events.

For information about how the new implementation works under the hood you can look at Kenneth Lundin’s presentation Erlang VM News Regarding Dirty Schedulers and I/O from the EUC 2017.

Kernel-space vs User-space polling

In OTP 21 the +K option has been removed as it is not longer possible to choose whether to use kernel-space poll or not at run-time. Instead the decision is made at compile time where kernel-space poll will be used by default. If you want to use user-space poll instead you have to pass the --disable-kernel-poll flag to configure when compiling Erlang/OTP.

Before OTP 21 it made sense to run using user-space polling if the file descriptors that was subscribed to tended to be removed quickly. For example if a HTTP server managed short-lived connection from only a handful other machines, it could be beneficial to use user-space poll. However if the connection start being long-lived, or the number of concurrent connection go up, kernel-space poll becomes better.

In OTP 21, this is no longer true. Because the polling has been moved to another thread, it is almost always better to use kernel-space polling. The user-space polling implementation is left in place for platforms that do not support parallel update of the kernel-space pollset. Also user-space polling is used for individual file descriptors when they cannot be put in a kernel-space pollset for some reason.

Poll-threads and Poll-sets

OTP 21 introduces two new configuration parameters: +IOt and +IOp.

Configure +IOt

+IOt controls the number of threads that are used to deliver events. The default is 1 and it should be enough for most applications. However on very busy systems with many concurrent connection it could be beneficial to increase this. One way to get an indication of whether your system could benefit from it is by using msacc. If you turn it on briefly and when examining the msacc:print() output notice that sleep time of the the thread type poll is low, the system may benefit from increasing the number of polling threads.

Eshell V9.3  (abort with ^G)
1> msacc:start(10000),msacc:print().
Average thread real-time    : 10000410 us
Accumulated system run-time :      937 us
Average scheduler run-time  :      897 us

        Thread      aux check_io emulator       gc    other     port    sleep

Stats per thread:
     async( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
       aux( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
      poll( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
 scheduler( 1)    0.00%    0.00%    0.00%    0.00%    0.01%    0.00%   99.99%

Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
           aux    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_sche    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_sched    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
          poll    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
     scheduler    0.00%    0.00%    0.00%    0.00%    0.01%    0.00%   99.99%

In the example above the poll thread is sleeping for 100% of the time so no need to increase the number of poll threads.

Configure +IOp

+IOp controls the number of pollsets used to put the file descriptors in. This options defaults to 1, and it should be very rare for any system to benefit from changing this. The only time so far that I have seen it to be beneficial is when the kernel-space poll implementation does not scale well when accessed in parallel by multiple threads. So if you run perf top (or something similar) on your system and notice that a lot of time is spent locking the kernel-space pollset, it would be a good idea to increase the number of pollsets used.


Beautiful Code

A bit of poetry… in Erlang

Almost exactly two years ago, I posted part of what follows at the Inakos blog. That blog is now gone and Erlang Solutions will republish most of its articles soon. But this one in particular, I wanted to have it on my own blog.

With my dad on the left, with my kid on the right

I believe that, as a developer, I’m closer to an artist than an engineer. I see software development as a creative endeavour. One that lets you express yourself in many different ways and generate much more than just 1s and 0s.

But, of course, not all programs or systems are artistically pleasant by themselves. That’s why, when the Inakos needed a flyer for a talk at the ECI that represented the idea of beautiful code, I wanted to be sure to write something actually beautiful. That’s why I borrowed the lyrics of one of the most magnificent songs in history and remixed it using my favorite language: Erlang.

It’s a song that means a lot to me, because it meant a lot to my father (who was also a programmer) and I didn’t fully understand it until I had a son of my own. The first time I sung this song to my kid something clicked in and I finally understood what it meant and how true it was. I wrote the original post right after my son’s 3rd birthday and to this day (a few days before his sixth one) I still think it’s one of my most precious pieces of code.

The Original (Updated)

So, here you have it, one of my favorite songs of all time, but now in Erlang:

For those who have actually read the original, you’ll notice I updated it a bit… with things I learned over these years. I hope you don’t mind.

Something New

Since that one was written years ago, let me give you something brand new as well. It’s based again on one of my father’s all-time favorite songs that I couldn’t comprehend entirely until I could see my life reflected on it as a father. This one was originally written in Spanish, but I think you’ll understand it anyway…

Of course… none of these examples work in Erlang, even when they do compile nicely. Don’t forget that poems and lyrics not always strictly work according to the rules of the languages in which they’re written. This is art after all, right? 😉

Beautiful Code was originally published in Erlang Battleground on Medium, where people are continuing the conversation by highlighting and responding to this story.


This Week In Erlang April 13

Welcome to another “This week in Erlang” newsletter.

Articles and Blog posts

Library Updates

Library of the Week

If you still haven’t seen datum, you should take a look immediately. This library by Dmitry (AKA. fogfish) allows you to write clean “pure functional” Erlang programs. Give it a try and “scrap your boilerplate”:



Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.