RPC over RabbitMQ (with Elixir)

At Community we use RabbitMQ, a lot. It’s the infrastructure backbone that allows our services (over fourty at this point) to communicate with each other. That mostly happens through events (since we have an event-sourced system), but in some cases what we need is a request-response interaction between two services. This is the best tool in a few use cases, like retrieving data on the fly or asking a service to do something and return a response. An industry standard for such interactions is HTTP, but we are not big fans of that. Instead, since RabbitMQ is so ubiquitous in our system, we settled on using it for request-response interactions as well in the form of Remote Procedure Calls (RPCs). In this post, I’ll go over the architecture of such interactions, the RabbitMQ topologies we use to make them work, the benefits around reliability and the compromises around performance, and finally how this all implemented to be as fault-tolerant as possible with Elixir.

Permalink

Erlang: Building a Telnet Chat Server from Scratch Using ZX

A few weeks ago I made a two-part video discussion about building a telnet chat server from scratch using ZX and forgot to post any prose reference to it. (Most people are following my blog RSS, not my (extremely tiny) video channels.)

The resulting project is called “Trash Talk” and it has a repo on GitLab here. The license is MIT and is quite simple to hack around on, so have fun.

Part 1

The first video is a bit slower paced than the second. I cover:

  • What ZX templates when you do zx create project and select the “Traditional Erlang Application” project option
  • How everything fits together and works
  • Discussion about why things are structured the way they are in OTP applications
  • Demonstrate a little bit of command implementation in the telnet server (mostly to show where in the code you would do that).

This video is a lot less exciting than the second one because there aren’t a lot of jazzy features demonstrated, but it is probably the one you want to watch once now if you’re new to Erlang and a second time a few weeks later once you’ve written a few projects and stubbed your toes a few times (a second viewing brings different insights, not because the video changes but because you change through your experiences).

Part 2

The second video is the more fun one because the initial explanation that covers the critical question of “Why does anything do anything?” has already been covered in the first one, and while you might find the first video interesting, it isn’t as exciting as this second one where you get to see features that do stuff that might be more relevant to problems you have in your real-world projects get implemented. In this one I cover:

  • Checking the state of the running system using observer
  • The difference between zx and zxh when running your code in development
  • The “Service ⇒ Worker Pattern” (aka “SWP”) and how ZX can help you drop one in via zx template swp
  • One way to implement text commands in a structured way, with categories of commands indicated by prefixes (this seems elementary and very IRC-ish at first, but in command protocols a direct parallel of this often happens at the byte level — so think on that)
  • How to separate the “clients” concept as a service from the “channels” concept as a service, and discuss how that idea extends to much more complex systems
  • A bit more useful version of a “service manager” process in the form of the chan_man
  • And much much more! (advertising people from the 1980’s tell me I should always say that instead of “etc.”)

These aren’t super jazzy, but they get the job done. Hopefully they give some new Erlanger out there a leg-up on the problem of going from a “Hello, World!” escript to a proper OTP application.

Permalink

Introducing Caramel - An Erlang Backend for the OCaml compiler that provides a fast type-checker for BEAM-based technologies.

Type-checking for the BEAM is one of the most common barriers for adoption. From comments on our LinkedIn posts to threads on HackerNews, it is something that we know can make people nervous. For those in the Erlang and Elixir community, tools like Dialyzer have done a great job of meeting some of the core requirements where type-checking would be necessary, but it’s also understandable that they’re not the right tool for every user. As a result, there has been a substantial increase in the number of languages and frameworks that looks to address this need. At Code BEAM V, Anton Lavrik announced that Facebook would be releasing their own version of statically-typed Erlang. Our friend Louis Pilford has also developed Gleam, another fast and performant language that works on the BEAM VM and offers static typing.

Why Static-Typing

Erlang is a dynamically and strongly typed language that is famous for allowing teams to reduce the lines of codes needed in their system. As a result, the teams required for many Erlang systems have been small (just think of Whatsapp managing 900 million users with a backend team of only 50 server-side developers). However, as the need for scalable, distributed systems has grown (and with it, the demand for technologies that offer the benefits of BEAM-based technologies) the interest in Erlang and Elixir from broader teams, with systems built in a mix of technologies has also grown. The growing demand for BEAM based technologies has led to an increased desire to make it easier to catch errors at the time of compiling before they enter the production system. In theory, this could allow developers to create the ultimate crash proof systems, with the type checker catching everything before production and the BEAM VM, offering all the ‘let it crash’ benefits for handling failure.

Enter Caramel - the OCaml backend for the BEAM

The newest approach to type-checked Erlang has been created by our colleague Leandro Ostera. This began as a hobby project put together in a matter of weeks to help him understand why it is so hard to offer statically typed Erlang. As more people found out about the project, it’s popularity rapidly grew, and with good reason. Caramel now offers a highly expressive type-system and a blazingly fast type checker. As a result, your team can rule out entire classes of errors by using types that are closer to your domain.

Who is likely to benefit most from Caramel

As you can imagine, a system like this will be highly beneficial for those dealing with complex state machines. Caramel allows for a type-level representation of your business domain logic this in turn rules out illegal representations, making them entirely unrepresentable. Using Caramel, you will have a tool that tells you instantly if you are about to do something which you should not. Caramel will also make fearless refactoring of large codebases significantly easier to achieve.

Why OCaml

Rather than reinvent the wheel when implementing the type-checker for Erlang, Leandro chose to leverage off of the existing OCaml language with over 25 years of development. OCaml was originally designed for systems where correctness was critical. As a result, it has an extremely powerful type system that has benefitted from significant time (we’re talking millions of hours) and financial (millions of dollars) investment to produce the right tool for the job. As well as its type checker, Caramel will allow Erlang developers to use OCaml’s powerful module system, which will help with code reuse, enforcing codebase conventions and large scale refactors.

Caramel features

The first public release of Caramel in October features:
caramelc the Caramel compiler
erlang an OCaml library for handling Erlang code, including an AST, lexer, parser, and printer

The Caramel compiler

Caramel can compile a large portion of the OCaml language, in particular most of the immutable functional OCaml, and into idiomatic Erlang code.

Erlang library for OCaml

The Erlang library currently includes support for the vast majority of the Standard Erlang language, according to the grammar shipped with Erlang/OTP as of version 24. It includes a series of modules to make it easy to construct Erlang programs from OCaml, as well as deconstruct them and operate on them.

What’s next:

Given the core components that currently exist within Caramel - there is an exciting and ambitious road map ahead. In the future, we could see a situation where any Erlang or OCaml code can be translated and implemented interchangeably, making the world-class strengths of both languages effortlessly available to anyone who is familiar with either language. To learn more about how you can help make that happen head to Caramel on Github. Until then, if you’d like to talk to our expert team about how we can build reliable, scalable solutions to meet your needs, get in touch. We’re always happy to help.

You may also like:

Webinar on Caramel

Online Erlang and Elixir training

Permalink

Introducing Caramel - An Erlang Backend for the OCaml compiler that provides

Type-checking for the BEAM is one of the most common barriers for adoption. From comments on our LinkedIn posts to threads on HackerNews, it is something that we know can make people nervous. For those in the Erlang and Elixir community, tools like Dialyzer have done a great job of meeting some of the core requirements where type-checking would be necessary, but it’s also understandable that they’re not the right tool for every user. As a result, there has been a substantial increase in the number of languages and frameworks that looks to address this need. At Code BEAM V, Anton Lavrik announced that Facebook would be releasing their own version of statically-typed Erlang. Our friend Louis Pilford has also developed Gleam, another fast and performant language that works on the BEAM VM and offers static typing.

Why Static-Typing

Erlang is a dynamically and strongly typed language that is famous for allowing teams to reduce the lines of codes needed in their system. As a result, the teams required for many Erlang systems have been small (just think of Whatsapp managing 900 million users with a backend team of only 50 server-side developers). However, as the need for scalable, distributed systems has grown (and with it, the demand for technologies that offer the benefits of BEAM-based technologies) the interest in Erlang and Elixir from broader teams, with systems built in a mix of technologies has also grown. The growing demand for BEAM based technologies has led to an increased desire to make it easier to catch errors at the time of compiling before they enter the production system. In theory, this could allow developers to create the ultimate crash proof systems, with the type checker catching everything before production and the BEAM VM, offering all the ‘let it crash’ benefits for handling failure.

Enter Caramel - the OCaml backend for the BEAM

The newest approach to type-checked Erlang has been created by our colleague Leandro Ostera. This began as a hobby project put together in a matter of weeks to help him understand why it is so hard to offer statically typed Erlang. As more people found out about the project, it’s popularity rapidly grew, and with good reason. Caramel now offers a highly expressive type-system and a blazingly fast type checker. As a result, your team can rule out entire classes of errors by using types that are closer to your domain.

Who is likely to benefit most from Caramel

As you can imagine, a system like this will be highly beneficial for those dealing with complex state machines. Caramel allows for a type-level representation of your business domain logic this in turn rules out illegal representations, making them entirely unrepresentable. Using Caramel, you will have a tool that tells you instantly if you are about to do something which you should not. Caramel will also make fearless refactoring of large codebases significantly easier to achieve.

Why OCaml

Rather than reinvent the wheel when implementing the type-checker for Erlang, Leandro chose to leverage off of the existing OCaml language with over 25 years of development. OCaml was originally designed for systems where correctness was critical. As a result, it has an extremely powerful type system that has benefitted from significant time (we’re talking millions of hours) and financial (millions of dollars) investment to produce the right tool for the job. As well as its type checker, Caramel will allow Erlang developers to use OCaml’s powerful module system, which will help with code reuse, enforcing codebase conventions and large scale refactors.

Caramel features

The first public release of Caramel in October features:
caramelc the Caramel compiler
erlang an OCaml library for handling Erlang code, including an AST, lexer, parser, and printer

The Caramel compiler

Caramel can compile a large portion of the OCaml language, in particular most of the immutable functional OCaml, and into idiomatic Erlang code.

Erlang library for OCaml

The Erlang library currently includes support for the vast majority of the Standard Erlang language, according to the grammar shipped with Erlang/OTP as of version 24. It includes a series of modules to make it easy to construct Erlang programs from OCaml, as well as deconstruct them and operate on them.

What’s next:

Given the core components that currently exist within Caramel - there is an exciting and ambitious road map ahead. In the future, we could see a situation where any Erlang or OCaml code can be translated and implemented interchangeably, making the world-class strengths of both languages effortlessly available to anyone who is familiar with either language. To learn more about how you can help make that happen head to Caramel on Github. Until then, if you’d like to talk to our expert team about how we can build reliable, scalable solutions to meet your needs, get in touch. We’re always happy to help.

You may also like:

Webinar on Caramel

Online Erlang and Elixir training

Permalink

Fintech 2.0 Incumbents vs Challengers - Banking’s Battle Royale

The fundamental building blocks of finance and financial services (FS) are transforming driven by emerging technologies and changing societal, regulatory, industrial, commercial and economic demands. Fintech or financial technology is changing the FS industry via infrastructure-based technology and open APIs. The venue for last week’s World Fintech Forum was filled with fintechs from across the ecosystem and representatives from big banks. The agenda promised two days of talks that would cover the spectrum of discussion points that are occupying the industry and this blog post will look at the Top 10 Takeaways for those unable to attend the event.

Key takeaways

  1. Great user experience is vital
  2. There’s a lot of regional variation in fintech disruption
  3. More challengers and incumbents are partnering
  4. Big Tech’s strength is in data
  5. Open Banking: regulations proving a boost to fintechs
  6. Blockchain: A dramatic increase in levels of investment
  7. For Cryptocurrencies - the jury’s still out
  8. AI is key
  9. Talent recruitment is one of the biggest challenges
  10. RegTech enables conforming with the changing regulatory landscape

1. The customer really is king

Consumers and business customers alike have embraced the idea of on-demand finance, thanks to mobile and cloud computing. Fintech trends show that people are more comfortable managing their money and business online, and they’re less willing to put up with the comparatively slower and less flexible processes of traditional financial services.

Present on day one of the conference was Laurence Krieger, COO of SME lender Tide, who views customer experience as the key advantage held by challengers. At Tide, they spotted the gap in SME banking where incumbents were simply offering the same type of products as those offered to retail customers but just repackaged as being for SMEs. Meanwhile, Filip Surowiak from ViaCash looked at how they are bridging the gap between cash and digital cash solutions by addressing the movement away from branch and ATM use to a more convenient model. It is these customer-centric approaches which will both win and retain business for fintechs and incumbents alike.

2. The West is moving at a different pace to Asia

While there are over 1,600 UK fintech companies, a figure set to double by 2030 according to the UK Government’s Tech Nation report, it’s emerging economies which lead the global charge in the sector. In the UK there is a 71 percent adoption of fintech while it’s 87 percent in both China and India.

Chinese fintech ecosystems have scaled and innovated faster than their counterparts in the West. In Asia there are singular platforms or super apps which combine FS entertainment and lifestyle products, as yet the equivalent does not exist elsewhere.

3. Partnerships are the favoured way to go for incumbents and fintechs

Despite recent announcements such as of HSBC’s Kinetic platform and Bo from RBS, for banks to build their own digital solutions takes significant investment and resources with no guarantees of success. It also takes significant capital to acquire a successful competing fintech. Strategic investments and partnerships was the approach most favoured by the incumbents present at Fintech Forum.

According to McKinsey’s report, Synergy and Disruption: Ten trends shaping fintech, 80 percent of financial institutions have entered fintech partnerships. Meanwhile, 82 percent expect to increase fintech partnerships in the next three to five years, according to a Department for International Trade report entitled UK FinTech State of the Nation.

At the conference BNY Mellon, RBS, Citi Ventures and others were all present and positioning their organisation as being a willing and accessible partner for budding fintech startups. Luis Valdich, MD at Citi Ventures, spoke of their success stories with Pindrop Security and Trulioo and warned us all to avoid being pushed towards a focus on services instead of on the product in collaborations. Udita Banerjee from RBS and Christian Hull from BNY Mellon also painted a welcoming environment in which ambitious startups can succeed.

4. Fintech v Techfin - Big Tech’s data riches

During day one’s panel on FS disruption - Stephanie Waismann, CCO from Barclays, outlined the key advantage for incumbents exists in the data which they hold. This may be true, but when it comes to the amount of data available - Big Tech’s GAFA (Google, Apple, Facebook and Amazon) are unrivalled. They appear poised to grab a substantial slice of the pie by leveraging their deep customer relationships and knowledge with which to offer financial products.

Joan Cuko, mobile payments analyst form Vodafone, examined Big Tech’s role in the future of FS, classifying them as frenemies and (again) collaboration was highlighted as being the ‘best path for long-term growth’.

The positioning of fintech challengers to align closer towards the tech part of what they do indicates the importance they attach to this side of their business. Noam Zeigerson, Chief Data Officer at Tandem Bank, poses the rhetorical question of whether they are either a bank or a tech player. Likewise, Tide’s Laurence Krieger sees fintechs as primarily being tech players and financial services providers second.

5. Open Banking’s untapped potential

Open Banking’s enabling technology APIs allow non-financial businesses to offer financial services to their clients based on customer banking data is recognised as having revolutionising potential.

Sam Overton, Head of Commercial at Bud, discussed the latent power behind Open Banking and how banks and fintechs can approach end consumers to empower them and prove that their data and privacy are not at risk. He emphasised the need for alignment on value issues where it is being approached in a one-fits-all approach and that for real traction there needs to be segmentation for individual user groups such as young parents.

6. Blockchain is here to stay?

Although DLT or blockchain’s components – such as cryptographic hashes, distributed databases and consensus-building – are not new. When combined, they create a powerful new form of data sharing and asset transfer, capable of eliminating intermediaries, central third parties and expensive reconciliation processes.

Monday’s panel included Keith Bear, Fellow of The Cambridge Judge Business School, whose ‘2nd Global Blockchain Benchmarking Study’ found:

The Banking, Financial Markets and Insurance industries are responsible for the largest share of live (enterprise blockchain) networks.

Manu Manchal, Managing Director of Consensys, observes the FS landscape as one of price compression and increased competition with blockchain essential to contributing to the ultimate ‘goal of bringing the cost of providing financial products to zero’. Noam Zeigerson from Tandem Bank regards blockchain as the ‘most transformative technology of all time, only surpassed by the internet’. Soren Mortensen, Director of Global Financial Markets at IBM, summed things up nicely when it comes to blockchain in FS: ‘everyone acknowledges that the technology is proven – what’s needed is greater value propositions and use cases’.

7. For Cryptocurrencies - the jury is out

OKCoin’s Head of Europe, Gabor Nguy, set out the case for cryptocurrency’s role within FS. While the benefits of blockchain for improving banks’ processes are accepted, cryptocurrency is still seen somewhat as the unruly upstart within much of the financial family. Gabor argued with broad agreement that ‘digital assets are here to stay’ – only this week a UK legal panel has defined their recognition of both digital assets and smart contracts.

Again, when it comes to the adoption of digital assets, Europe is some way behind in terms of mass adoption when compared to Asia. Our Technical Lead, Dominic Perini’s recent blog post discusses the psychology of digital asset ownership and provenance.

8. AI is the technology ‘must-have’ for banks

Whereas for many, blockchain is the technology which holds the biggest potential, the panel on the second day highlighted the role of AI. The global technology research firm, Gartner, estimates that the blockchain market will be worth roughly $23 billion by 2023, as a way of comparison the estimated business value created by AI is $3.9 trillion for 2022. Jeremy Sosabowski, who works with AI for risk analysing at Algo Dynamix, sees it as not being a nice-to-have, but a ‘must-have for banking’.

The biggest potential for AI lies around middle-office banking involving compliance and risk. Angelique Assaf spoke about Cedar Rose’s use of AI Powered Models for Credit Risk Assessments with a clear message that: ‘Data is our asset and AI is our tool’.

The key consideration for implementing AI was again focused on the question of To Build or To Buy. With 60 percent of AI talent being absorbed into the tech and FS sectors, Nils Mork-Ulnes, Head of Strategy at AIG, was on hand to provide valuable insight into building an AI centred product team.

Soren Mortensen from IBM framed things as AI not actually existing (not until at least 2050) – and what we currently have is Augmented Intelligence. He identified the key driver for successful AI implementation as data availability – it’s quality and appropriateness.

Of course, the high profile fallout involving the Goldman Sachs backed Apple Credit Card provides a timely reminder of the nascent nature of both cross-industry collaboration and the use of AI in the form of blackbox algorithms.

9. The human element is still key to success

Getting specialist skills, both technical and non-technical, is generally recognised as one of the biggest challenges in the fintech space. Laurence Krieger from Tide explained that, in his experience, the very best graduates now want to move into the fintech/startup world. Meanwhile, Stephanie Waismann from Barclays observed that many banks are looking to recruit form outside of the banking and FS industry for a broader range of skills and are going towards students and universities to help fill gap areas.

10. Regulatory compliance must be at the centre of planning

Emerging international standards have mostly taken the form of high-level principles, leaving national implementation (both regulation and supervision) to diverge considerably across jurisdictions and across different FS sectors. The Financial Conduct Authority (FCA) is a key influence in setting standards of regulation globally.

On hand to advise eager young startups on the importance of regulatory compliance was Ross Paolino, General Counsel at Western Union. Of course, FS is one of the most highly regulated industries and compliance is becoming of increasing importance whether that be for startups or large institutions.

Agreed upon by all was the fact that tech simply moves faster than regulators can regulate – and that the gap will always exist to a greater or lesser extent. He directs us towards sandboxes as the best options for new ideas to be tested.

Concluding Thoughts

Similarly to what we have witnessed with publishing, financial services are made up of information rather than physical goods and are therefore seen as one of the industries most vulnerable to disruption by software technology.

Of course, not all fintech startups are out to hurt banks, and in fact, many services use legacy platforms to bring them more customers. For incumbents the costs of digitisation are substantial, partnering with specialist vendors is the most efficient way to approach implementing changes to each technology layer. Trying to connect the dots between different parts of the community here in the UK and reaching out as a portal for organisations worldwide is Fintech Alliance who are backed by the Department for International trade. They aim to provide access to people, firms and information, including connections to investors, policy and regulatory updates, and the ability to attract and hire workers.

In the end, the talks from the stage and taking place during the networking sessions at the World Fintech Forum lived up to expectations and left an overall impression of positivity around the fintech ecosystem.

To keep up to date with what we are doing in the fintech space involving building scalable, fault-tolerant systems you should subscribe for updates here.

We thought you might also be interested in

What We Do in the Financial Services Sector

Trifork and Erlang Solutions Fintech Case Studies

Our Fintech Highlights for 2019

Permalink

Real-time collaboration with Elixir at Slab

Slab is a knowledge base and team wiki that democratizes knowledge. Jason Chen started Slab in August 2016, after picking Elixir and Phoenix as the best tools to build real-time collaborative applications. The company has grown to 6 engineers since then, distributed worldwide, and relied upon by more than 7000 companies and customers like Asana, Discord, and Glossier. If you are interested in helping companies become a source of learning and purpose, especially during these times where remote collaboration is essential, Slab is hiring.

Slab

Why Elixir?

Slab was not the first time Jason wrote a collaborative web app. He had previous practice building them in Rails and Node.js and he believed there was a lot to improve in the development experience, especially when it came to working with WebSockets. Both technologies were also troublesome in production, as the team faced issues scaling them vertically and horizontally.

I wanted a framework with the same developer experience as Django and Rails, but one that was designed for real-time applications.

— Jason Chen, CEO, on the Phoenix web framework

Jason doesn’t consider himself a person who is always looking for new things, but he knew he would have to survey the options around him when starting Slab. During this period, he explored two main languages: Go and Elixir. In the end, Jason chose Elixir, thanks to the Phoenix web framework: “I was looking for a framework that offered a complete toolset for building web apps. I was not interested in making low-level choices, such as which ORM to use, which library to pick for parsing requests, etc. I wanted a framework with the same developer experience as Django and Rails, but one that was designed for real-time applications”.

Jason gave himself two weeks to build a proof of concept. He wrote a collaborative blog, where multiple users could write a post simultaneously, and comments were added in real-time — all while learning Elixir and the Phoenix framework.

The trial went better than expected, and Jason’s journey with Slab had officially begun.

Growing with the platform

Shortly after, Slab was in a private beta with a handful of companies as users. For each major feature they had along the way, Elixir and Phoenix provided the building blocks for it. When they implemented real-time comments, they used Phoenix Channels and Phoenix PubSub. The pattern goes on: “for asynchronous processing, we simply use Elixir tasks”. Later on, to track users editing a document and give each a different cursor color, they used Phoenix Presence, a tool that no other web framework offers out-of-the-box.

Another leap in Jason’s journey with Slab and Elixir was when he had to learn Erlang/OTP, a group of behaviors that ship as part of Erlang’s standard library for building distributed and fault-tolerant applications.

To improve the real-time collaborative editor that is part of Slab, Jason implemented Operational Transformation. The client runs in the browser, implemented with React. As users make changes to the text, their diffs are sent to the server, which arbitrates these updates and synchronizes them across the various clients.

Tackling the synchronization problem is not trivial, especially when the application is running on multiple nodes. Here is the challenge they faced. Imagine user Alice has a WebSocket connection to node X and user Bob is connected to node Y. Both Alice and Bob are working on the same text. How can Slab guarantee that changes from both users are applied, so both see the same document once done editing?

One could try to solve this problem by keeping the server stateless. Every time the server receives a diff from the client, the server would read the document from the database, apply the changes, normalize the result, and broadcast the clients’ updates. With this approach, the issue is that loading the text from the database on every client update would quickly become expensive, especially as they grow in size. Response times would become higher and the user experience would degrade.

When working with Node.js, Jason tried a different approach. If Alice and Bob were writing to the same document, a load balancer would guarantee that both would be routed to the same node. After trying out both Apache and Nginx, he implemented the balancer in Node.js. The overall solution was time-consuming to get right and introduced operational complexities.

Luckily, these problems are the bread and butter of Erlang/OTP. Jason knew he needed a stateful abstraction to keep this state on the server. He had already heard about the options the platform provides, but he was unsure which one to pick. Jason recalls: “I remember asking the community if I should use an Agent or a GenServer and everyone was really helpful in providing guidance.” They quickly landed on GenServer as their tool of choice.

By default, both GenServer and Agents are local to each node. However, they also support the :global option, which registers a given name across the cluster. To use this option, they need the Erlang distribution, which they were already using for Phoenix PubSub and Presence, so this was a straight-forward change. This guarantees both Alice and Bob talk to the same GenServer, regardless if they joined node X or node Y.

Later on, when running the system in production, the platform continued to impress him. Every time they increased the machine resources, they could see the runtime efficiently using everything it had available, without changes to the code.

Learning and tools

There are other few notable tools in Slab’s stack.

Back in 2017, they migrated to GraphQL powered by Elixir’s Absinthe. There were concerns about adopting the query language, as it was a relatively new technology. Still, they felt it would address a real issue: they had different components in the application needing distinct data, and managing all of these possible combinations was becoming complex. This was one of the main problems GraphQL was designed to solve.

They are also running on Google Cloud with Kubernetes (K8s), and, as many Elixir engineers, they wondered how the Erlang VM fit in a world with Docker and K8s. Today they run on 6 nodes, 5 of them running application code. The sixth one handles cron jobs and stays on standby for new deployments. They use the peerage library to establish Distributed Erlang connections between the nodes.

We really value Elixir's ability to build complex systems using fewer moving parts. The code is simpler, and the system is easier to operate.

— Sheharyar Naseer, engineer

Overall the Slab team aims to keep the number of dependencies low, something they believe is made possible by the platform and positively impacts onboarding new developers. Sheharyar Naseer, a member of their engineering team, explains: “We really value Elixir’s ability to build complex systems using fewer moving parts. The code is simpler, and the system is easier to operate, making both experienced and new engineers more productive. We ran in production for more than 3 years without resorting to Redis. We just recently added it because we wanted our caches to survive across deployments. Many other stacks impose technologies like Redis from day one.”

This approach also yields benefits when updating libraries. Sheharyar continues: “For the most part, upgrading Erlang, Elixir, and Phoenix is straight-forward. We go through the CHANGELOG, which always emphasizes the main changes we need to perform, and we have a pull request ready after one or two hours. The only time we could not upgrade immediately was when Erlang/OTP removed old SSL ciphers, which broke our HTTP client and we caught it early on during development.”

When onboarding engineers, Slab recommends them different books and video courses — many of which you can find in our learning resources page — so they have the flexibility to choose a medium they are most productive with. New engineers also work on Slab itself and receive guidance through pull requests. They start with small tasks, usually in the client and GraphQL layers, and slowly tackle more complex problems around the database and Erlang/OTP. If you are interested in improving remote collaboration, learn more about their opportunities on their website.

Permalink

Further adventures in the JIT

This post continues our adventures in the JIT, digging a bit deeper into the implementation details.

While writing things in machine code (assembler) gives you great freedom it comes at the cost of having to invent almost everything yourself, and there’s no clever compiler to help you catch mistakes. For example, if you call a function in a certain manner and said function doesn’t expect that, you’ll crash your OS process at best or spend hours chasing a heisenbug at worst.

Hence, conventions are always front and center when writing assembler, so we need to visit some of the ones we’ve chosen before moving on.

The most important one concerns registers, and we base it on the system calling convention to make it easier to call C code. I’ve included tables for the SystemV convention used on Linux below. The registers differ on other systems like Windows, but the principle is the same on all of them.

Register Name Callee save Purpose
RDI ARG1 no First argument
RSI ARG2 no  
RDX ARG3 no  
RCX ARG4 no  
R8 ARG5 no  
R9 ARG6 no Sixth argument
RAX RET no Function return value

Thus, if we want to call a C function with two arguments, we move the first into ARG1 and the second into ARG2 before calling it, and we’ll have the result in RET when it returns.

Beyond saying which registers are used to pass arguments, calling conventions also say which registers retain their value over function calls. These are called “callee save” registers, since the called function needs to save and restore them if they’re modified.

In these registers, we keep commonly-used data that rarely (if ever) changes in C code, helping us avoid saving and restoring them whenever we call C code:

Register Name Callee save Purpose
RBP active_code_ix yes Active code index
R13 c_p yes Current process
R15 HTOP yes Top of the current process’ heap
R14 FCALLS yes Reduction counter
RBX registers yes BEAM register structure

We also keep the current process’ stack in RSP, the machine stack pointer, to allow call and ret instructions in Erlang code.

The downside of this is that we can no longer call arbitrary C code as it may assume a much larger stack, requiring us to swap back and forth between a “C stack” and the “Erlang stack”.

In my previous post we called a C function (timeout) without doing any of this, which was a bit of a white lie. It used to be done that way before we changed how the stack works, but it’s still pretty simple as you can see below:

void BeamModuleAssembler::emit_timeout() {
    /* Swap to the C stack. */
    emit_enter_runtime();

    /* Call the `timeout` C function.
     *
     * runtime_call compiles down to a single `call`
     * instruction in optimized builds, and has a few
     * assertions in debug builds to prevent mistakes
     * like forgetting to switch stacks. */
    a.mov(ARG1, c_p);
    runtime_call<1>(timeout);

    /* Swap back to the Erlang stack. */
    emit_leave_runtime();
}

Swapping the stack is very cheap because of a trick we use when setting up registers: by allocating the structure on the C stack we can compute the address of said stack from registers, which avoids having to reserve a precious callee save register and is much faster than having it saved in memory somewhere.

With the conventions out of the way we can start looking at code again. Let’s pick a larger instruction this time, test_heap, which allocates heap memory:

void BeamModuleAssembler::emit_test_heap(const ArgVal &Needed,
                                         const ArgVal &Live) {
    const int words_needed = (Needed.getValue() + S_RESERVED);
    Label after_gc_check = a.newLabel();

    /* Do we have enough free space already? */
    a.lea(ARG2, x86::qword_ptr(HTOP, words_needed * sizeof(Eterm)));
    a.cmp(ARG2, E);
    a.jbe(after_gc_check);

    /* No, we need to GC.
     *
     * Switch to the C stack, and update the process
     * structure with our current stack (E) and heap
     * (HTOP) pointers so the C code can use them. */
    emit_enter_runtime<Update::eStack | Update::eHeap>();

    /* Call the GC, passing how many words we need and
     * how many X registers we use. */
    a.mov(ARG2, imm(words_needed));
    a.mov(ARG4, imm(Live.getValue()));

    a.mov(ARG1, c_p);
    load_x_reg_array(ARG3);
    a.mov(ARG5, FCALLS);
    runtime_call<5>(erts_garbage_collect_nobump);
    a.sub(FCALLS, RET);

    /* Swap back to the Erlang stack, reading the new
     * values for E and HTOP from the process structure. */
    emit_leave_runtime<Update::eStack | Update::eHeap>();

    a.bind(after_gc_check);
}

While this isn’t too complicated it’s still a rather large amount of code: since all instructions are emitted directly into their modules, small inefficiencies like this tend to bloat the modules rather quickly. Beyond using more RAM, this wastes precious instruction cache so we’ve spent a lot of time and effort on reducing code size.

Our most common method of reducing code size is to break out as much of the instruction as possible into a globally shared part. Let’s see how we can apply this technique:

void BeamModuleAssembler::emit_test_heap(const ArgVal &Needed,
                                         const ArgVal &Live) {
    const int words_needed = (Needed.getValue() + S_RESERVED);
    Label after_gc_check = a.newLabel();

    a.lea(ARG2, x86::qword_ptr(HTOP, words_needed * sizeof(Eterm)));
    a.cmp(ARG2, E);
    a.jbe(after_gc_check);

    a.mov(ARG4, imm(Live.getValue()));

    /* Call the global "garbage collect" fragment. */
    fragment_call(ga->get_garbage_collect());

    a.bind(after_gc_check);
}

/* This is the global part of the instruction. Since we
 * know it will only be called from the module code above,
 * we're free to assume that ARG4 is the number of live
 * registers and that ARG2 is (HTOP + bytes needed). */
void BeamGlobalAssembler::emit_garbage_collect() {
    /* Convert ARG2 to "words needed" by subtracting
     * HTOP and dividing it by 8.
     *
     * This saves us from having to explicitly pass
     * "words needed" in the module code above. */
    a.sub(ARG2, HTOP);
    a.shr(ARG2, imm(3));

    emit_enter_runtime<Update::eStack | Update::eHeap>();

    /* ARG2 and ARG4 have already been set earlier. */
    a.mov(ARG1, c_p);
    load_x_reg_array(ARG3);
    a.mov(ARG5, FCALLS);
    runtime_call<5>(erts_garbage_collect_nobump);
    a.sub(FCALLS, RET);

    emit_leave_runtime<Update::eStack | Update::eHeap>();

    a.ret();
}

While we had to write about as much code, the part that is copied into the module is significantly smaller.

In our next post we’ll take a break from implementation details and look at the history behind this JIT.

Permalink

A first look at the JIT

Now that we’ve had a look at BEAM and the interpreter we’re going to explore one of the most exciting additions in OTP 24: the just-in-time compiler, or “JIT” for short.

If you’re like me the word “JIT” probably makes you think of Hotspot (Java) or V8 (Javascript). These are very impressive pieces of engineering but they seem to have hijacked the term; not all JITs are that sophisticated, nor do they have to be in order to be fast.

We’ve made many attempts at a JIT over the years that aimed for the stars only to fall down. Our latest and by far most successful attempt went for simplicity instead, trading slight inefficiencies in the generated code for ease of implementation. If we exclude the run-time assembler library we use, asmjit, the entire thing is roughly as big as the interpreter.

I believe much of our success can be attributed to four ideas we had early in the project:

  1. All modules are always compiled to machine code.

    Previous attempts (and HiPE too) had a difficult time switching between the interpreter and machine code: it was either too slow, too difficult to maintain, or both.

    Always running machine code means we never have to switch.

  2. Data may only be kept (passed) in BEAM registers between instructions.

    This may seem silly, aren’t machine registers faster?

    Yes, but in practice not by much and it would make things more complicated. By always passing data in BEAM registers we can use the register allocation given to us by the Erlang compiler, saving us from having to do this very expensive step at runtime.

    More importantly, this minimizes the difference between the interpreter and the JIT from the runtime system’s point of view.

  3. Modules are compiled one instruction at a time.

    One of the most difficult problems in our prior attempts was to strike a good balance between the time it took to compile something and the eagerness to do so. If we’re too eager, we’ll spend too much time compiling, and if we’re too lax we won’t see any improvements.

    This problem was largely self-inflicted and caused by the compiler being too slow (we often used LLVM), which was made worse by us giving it large pieces of code to allow more optimizations.

    By limiting ourselves to compiling one instruction at a time, we leave some performance on the table but greatly improve compilation speed.

  4. Every instruction has a handwritten machine code template.

    This makes compilation extremely fast as we basically just copy-paste the template every time the instruction is used, only performing some minor tweaks depending on its arguments.

    This may seem daunting at first but it’s actually not that bad once you get used to it. While it certainly takes a lot of code to achieve even the smallest of things, it’s inherently simple and easy to follow as long as the code is kept short.

    The downside is that every instruction needs to be implemented for each architecture, but luckily there’s not a lot of popular ones and we hope to support the two most common ones by the time we release OTP 24: x86_64 and AArch64. The others will continue to use the interpreter.

When compiling a module the JIT goes through the instructions one by one, invoking machine code templates as it goes. This has two very large benefits over the interpreter: there’s no need to jump between them because they’re emitted back-to-back and the end of each is the start of the next one, and the arguments don’t need to be resolved at runtime because they’re already “burnt in.”

Now that we have some background, let’s look at the machine code template for our example in the previous post, is_nonempty_list:

/* Arguments are passed as `ArgVal` objects which hold a
 * type and a value, for example saying "X register 4",
 * "the atom 'hello'", "label 57" and so on. */
void BeamModuleAssembler::emit_is_nonempty_list(const ArgVal &Fail,
                                                const ArgVal &Src) {
    /* Figure out which memory address `Src` lives in. */
    x86:Mem list_ptr = getArgRef(Src);

    /* Emit a `test` instruction, which does a non-
     * destructive AND on the memory pointed at by
     * list_ptr, clearing the zero flag if the list is
     * empty. */
    a.test(list_ptr, imm(_TAG_PRIMARY_MASK - TAG_PRIMARY_LIST));

    /* Emit a `jnz` instruction, jumping to the fail label
     * if the zero flag is clear (the list is empty). */
    a.jnz(labels[Fail.getValue()]);

    /* Unlike the interpreter there's no need to jump to
     * the next instruction on success as it immediately
     * follows this one. */
}

This template will generate code that looks almost identical to the template itself. Let’s say our source is “X register 1” and our fail label is 57:

test qword ptr [rbx+8], _TAG_PRIMARY_MASK - TAG_PRIMARY_LIST
jnz label_57

This is much faster than the interpreter, and even a bit more compact than the threaded code, but this is a trivial instruction. What about more complex ones? Let’s have a look at the timeout instruction in the interpreter:

timeout() {
    if (IS_TRACED_FL(c_p, F_TRACE_RECEIVE)) {
        trace_receive(c_p, am_clock_service, am_timeout, NULL);
    }
    if (ERTS_PROC_GET_SAVED_CALLS_BUF(c_p)) {
        save_calls(c_p, &exp_timeout);
    }
    c_p->flags &= ~F_TIMO;
    JOIN_MESSAGE(c_p);
}

That’s bound to be a lot of code, and those macros will be really annoying to convert by hand. How on earth are we going to do this without losing our minds?

By cheating, that’s how :D

static void timeout(Process *c_p) {
    if (IS_TRACED_FL(c_p, F_TRACE_RECEIVE)) {
        trace_receive(c_p, am_clock_service, am_timeout, NULL);
    }
    if (ERTS_PROC_GET_SAVED_CALLS_BUF(c_p)) {
        save_calls(c_p, &exp_timeout);
    }
    c_p->flags &= ~F_TIMO;
    JOIN_MESSAGE(c_p);
}

void BeamModuleAssembler::emit_timeout() {
    /* Set the first C argument to our currently executing
     * process, c_p, and then call the above C function. */
    a.mov(ARG1, c_p);
    a.call(imm(timeout));
}

This little escape hatch saved us from having to write everything in assembler from the start, and many instructions remain like this because there hasn’t been any point to changing them.

That’s all for today. In the next post we’ll walk through our conventions and some of the techniques we’ve used to reduce the code size.

Permalink

Using Event Sourcing and CQRS with Incident - Part 1

Using Event Sourcing and CQRS with Incident -  Part 1

This is the first of a series of posts that I will present on how your application can use Event Sourcing and CQRS for specific domains with an open-source library that I am developing called Incident.


My first contact with Event Sourcing was back in 2016 when I was working for Raise Marketplace and I led a project that the goal was to solve an accounting burden regarding seller payments. As a marketplace, in a nutshell, a buyer pays for something, the company gets a commission and the seller receives the remaining funds. Track the money flow, depending on the options the marketplace offers, can become complex. In that specific case, sellers could opt for combining funds to be paid daily, via different methods such as check, PayPal, ACH, and so on, or decide to request funds one by one. Around all of that, there was a fraud detection process, transfer limits, a different category of sellers, and so on.

Before Event Sourcing, there was a lack of a cohesive way to track the steps of each fund, from buyer to seller, and all possible scenarios. And when that comes to accounting people, it becomes a nightmare.

With well-defined commands, events, and logic associated with them, it became more clear the information the Accounting department needed at the time.

Later on, I had the opportunity to work on other personal projects using Event Sourcing so I decided to build something to help me moving forward, and from that learning came Incident.

What is Event Sourcing?

Event Sourcing is a design pattern that defines that the state changes of an entity are stored as a sequence of events. Events are the source of truth and immutable, and the current state of any entity is playing all events of the entity in the order they happened.

If you are new to Event Sourcing and CQRS I highly recommend watch Greg Young's presentation at Code on the Beach 2014 before moving forward as my intention with this blog post series is not to present Event Sourcing principles per se and the details but how those principles were used in the implementation.

What Event Sourcing is not?

One of the misconceptions that I often see is that if you decide to use Event Sourcing you should apply it to your entire system, to all your domains. This is  wrong in my opinion, an anti-pattern and you should avoid at all costs as unlikely all your domains will suit.

Another fact, Event Sourcing is not new, many industries using "Event Sourcing" concepts even not naming the same way. Accounting keeps track of all account operations, your medical record is about your health history, contracts don't change, they have addendums, and so on.

Incident Main Goals

When I decided to implement a new library, based on my learning through other projects, I had some goals in mind that I'd like to achieve:

  • incentivize the usage of Event Sourcing and CQRS as a great choice for domains that can leverage the main benefits of this design pattern;
  • offer the essential building blocks for using Event Sourcing in your system with proper contracts, but allowing specific needs to leverage what Elixir already brings to the table, for example, concurrency;
  • leverage functions and reducers for executing commands and applying events in the aggregate logic, facilitating stateless tests;
  • be extensible without compromising the main principles;

Events vs Projections

Events are the source of truth in any Event Sourcing domain, they are immutable and they are used to calculate the current state of any aggregate (or entity) at any time. All the events of any type, for any aggregate type, are stored in the Event Store in a single table.

The projections are the representation of the current state of an aggregate and they are very similar to what any system that does not use Event Sourcing has. The domain will have as many projection tables as you need but usually, you will have one table for each entity type. All projection tables are stored in the Projection Store.

The following diagram helps understand the separation between the command model from the query model, and their responsibilities as well.

  1. UI/API issues a Command to attempt to change the state;
  2. Aggregate logic that lives in the Command Model is used (including past events) and if everything is fine, a new Event will be persisted in the Event Store;
  3. The Event Handler will receive the new event and project the new aggregate state into the Projection Store;
  4. UI/API will query the Aggregate Current State from the Query Model;
Using Event Sourcing and CQRS with Incident -  Part 1

Aggregate vs Aggregate State

One of the things that I see when implementing Event Sourcing that makes it harder is to try to manage aggregate logic, aggregate state data structure, and aggregate state logic in the same place. Incident does a little differently.

The Aggregate will define how a specific entity (Bank Account, for example) will execute each of its allowed commands and apply each of its allowed events. The aggregate itself only defines the logic but not the current state calculation.

The Aggregate State defines the initial state of an Aggregate and it is able to calculate the current state by replaying all the events through the aggregate logic.

Back in 2013, Greg Young tweeted the following:

Incident follows that principle with the Aggregate logic in a nutshell being:

  • Command > Function > Event;
  • Event and State > Function > New State;

And part of the Aggregate State logic is similar to:

Enum.reduce(events, state, fn event, state ->
  aggregate.apply(event, state)
end)

Let's Get Started

In this series we will be using Incident to implement the Account domain of a Bank system for these main reasons:

  • it is a domain that benefits from the Event Sourcing principles;
  • it contains simple scenarios such as opening an account, depositing funds;
  • it contains complex scenarios such as transferring funds from one account to another;

Other common domains of a typical Bank system, for example, Client Profile, Authentication/Authorization won't be the focus of the Incident implementation as they are not a good fit, at least not in our case. This is to emphasize the fact that you don't need to have Event Sourcing in your entire system.

Application and Incident Setup

Let's create a new application for our Bank, including the supervision tree. As a side note, I am using Elixir 1.11 in this series so some of the Elixir configuration details might vary depending on the version you are using.

~> mix new bank --sup

Add Incident in mix.exs, fetch and compile the dependencies:

defmodule Bank.MixProject do
  use Mix.Project

  # hidden code
  
  def deps do
    [
      {:incident, "~> 0.5.1"}
    ]
  end
end
~> mix do deps.get, deps.compile

Generate Ecto Repos for the Event Store and Projection Store, this will create the repo modules:

~> mix ecto.gen.repo -r Bank.EventStoreRepo
...
~> mix ecto.gen.repo -r Bank.ProjectionStoreRepo
...

In your application config/config.exs specify the Ecto repos and configure Incident:

# hidden code

config :bank, ecto_repos: [Bank.EventStoreRepo, Bank.ProjectionStoreRepo]

config :incident, :event_store,
  adapter: Incident.EventStore.PostgresAdapter,
  options: [
    repo: Bank.EventStoreRepo
  ]

config :incident, :projection_store,
  adapter: Incident.ProjectionStore.PostgresAdapter,
  options: [
    repo: Bank.ProjectionStoreRepo
  ]

import_config "#{config_env()}.exs"

In your application config/dev|test|prod.exs (the example below defines two separated databases but it could be the same one), set up the database access for Ecto for each environment:

# config/dev.exs

# hidden code

config :bank, Bank.EventStoreRepo,
  username: "postgres",
  password: "postgres",
  hostname: "localhost",
  database: "bank_event_store_dev"

config :bank, Bank.ProjectionStoreRepo,
  username: "postgres",
  password: "postgres",
  hostname: "localhost",
  database: "bank_projection_store_dev"

Add the Ecto repo modules to the supervision tree in your lib/bank/application.ex:

defmodule Bank.Application do
  @moduledoc false

  use Application

  @impl true
  def start(_type, _args) do
    children = [
      Bank.EventStoreRepo,
      Bank.ProjectionStoreRepo
    ]

    opts = [strategy: :one_for_one, name: Bank.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Create the database(s), generate the Incident events table migration, and run the migrations:

~> mix ecto.create
...
~> mix incident.postgres.init -r Bank.EventStoreRepo
...
~> mix ecto.migrate
...

The setup is done, it seems a lot but most of it is a common setup needed for any application using Ecto.

The Bank Account

As we will keep track of bank accounts, we will define some initial components that are defined only once, and then later evolve the aggregate with the logic that will be based on the operations we want the aggregate to respond to.

Projection

We will need to present to the user some bank account information. So we will define one projection that will contain the current state of the bank accounts. Generate an Ecto migration as below, please notice the -r flag that specifies which repo the migration will be:

~> mix ecto.gen.migration CreateBankAccountsTable -r Bank.ProjectionStoreRepo

Populate the migration with the following fields. Besides the desired fields related to bank account data, every projection should also contain version, event_id, and event_date fields. They inform the last event that updated the projection and how many events were applied to it:

defmodule Bank.ProjectionStoreRepo.Migrations.CreateBankAccountsTable do
  use Ecto.Migration

  def change do
    create table(:bank_accounts) do
      add(:aggregate_id, :string, null: false)
      add(:account_number, :string, null: false)
      add(:balance, :integer, null: false)
      add(:version, :integer, null: false)
      add(:event_id, :binary_id, null: false)
      add(:event_date, :utc_datetime_usec, null: false)

      timestamps(type: :utc_datetime_usec)
    end

    create(index(:bank_accounts, [:aggregate_id]))
  end
end

And the bank account projection schema:

defmodule Bank.Projections.BankAccount do
  use Ecto.Schema
  import Ecto.Changeset

  schema "bank_accounts" do
    field(:aggregate_id, :string)
    field(:account_number, :string)
    field(:balance, :integer)
    field(:version, :integer)
    field(:event_id, :binary_id)
    field(:event_date, :utc_datetime_usec)

    timestamps(type: :utc_datetime_usec)
  end

  @required_fields ~w(aggregate_id account_number balance version event_id event_date)a

  def changeset(struct, params \\ %{}) do
    struct
    |> cast(params, @required_fields)
    |> validate_required(@required_fields)
  end
end

Our First Command and Event

The first operation we will allow is the ability to open a bank account. One of the advantages of Event Sourcing is called ubiquitous language. Every operation in the application is called by what normally the business understands it. Instead of creating a bank account, that is very database-driven, a bank account would be opened. This same language guides us on how we name commands (intentions) and events (facts), there is no need for business > technical, or vice-versa, translations.

Let's define an Open Account command. Commands in Incident define the command data and also can validate themselves. You can use Elixir structs or Ecto Schemas with Changesets, as long as it implements valid?/1.

defmodule Bank.Commands.OpenAccount do
  @behaviour Incident.Command
  
  use Ecto.Schema
  import Ecto.Changeset

  @primary_key false
  embedded_schema do
    field(:account_number, :string)
  end

  @required_fields ~w(account_number)a

  @impl true
  def valid?(command) do
    data = Map.from_struct(command)

    %__MODULE__{}
    |> cast(data, @required_fields)
    |> validate_required(@required_fields)
    |> Map.get(:valid?)
  end
end

Then, let's create an event data structure called Account Opened to hold the data to reference the successful operation. Similar to commands, event data can be implemented using Ecto Schema or Elixir structs.

defmodule Bank.Events.AccountOpened do
  use Ecto.Schema

  @primary_key false
  embedded_schema do
    field(:aggregate_id, :string)
    field(:account_number, :string)
    field(:version, :integer)
  end
end

The Bank Account Aggregate and Aggregate State

Any operation in our bank application will happen or not based on some logic that we will define. The place for this logic is the Bank Account aggregate module. We also need to create the Bank Account State, that will specify what is the initial state of a bank account:

defmodule Bank.BankAccountState do
  use Incident.AggregateState,
    aggregate: Bank.BankAccount,
    initial_state: %{
      aggregate_id: nil,
      account_number: nil,
      balance: nil,
      version: nil,
      updated_at: nil
    }
end

An Incident aggregate will implement two functions:

  • execute/1 will receive a command and based on any logic return an event or an error tuple, and;
  • apply/2 that will receive an event and a state, and return a new state;

In our opening account logic, we simply check if the account number already exists, if not, an event is returned, otherwise, the error is returned.

defmodule Bank.BankAccount do
  @behaviour Incident.Aggregate

  alias Bank.BankAccountState
  alias Bank.Commands.OpenAccount
  alias Bank.Events.AccountOpened
  
  @impl true
  def execute(%OpenAccount{account_number: account_number}) do
    case BankAccountState.get(account_number) do
      %{account_number: nil} = state ->
        new_event = %AccountOpened{
          aggregate_id: account_number,
          account_number: account_number,
          version: 1
        }

        {:ok, new_event, state}

      _state ->
        {:error, :account_already_opened}
    end
  end
  
  @impl true
  def apply(%{event_type: "AccountOpened"} = event, state) do
    %{
      state
      | aggregate_id: event.aggregate_id,
        account_number: event.event_data["account_number"],
        balance: 0,
        version: event.version,
        updated_at: event.event_date
    }
  end
end

The Event Handler

The Event Handler is the connection between the command side and the query side in the domain. Once an event happens on the command side you can decide what to do on the query side, usually, the aggregate projection is updated with the new data so the UI can read it.

An Incident event handler implements listen/2 that pattern matches the event type, in the AccountOpened example we ask the aggregate to return the new state, and based on that we build the data to be projected. Depending on the event type you can perform other side effects and we will see that in the following posts in this series.

defmodule Bank.BankAccountEventHandler do
  @behaviour Incident.EventHandler

  alias Bank.Projections.BankAccount
  alias Bank.BankAccount, as: Aggregate
  alias Incident.ProjectionStore

  @impl true
  def listen(%{event_type: "AccountOpened"} = event, state) do
    new_state = Aggregate.apply(event, state)

    data = %{
      aggregate_id: new_state.aggregate_id,
      account_number: new_state.account_number,
      balance: new_state.balance,
      version: event.version,
      event_id: event.event_id,
      event_date: event.event_date
    }

    ProjectionStore.project(BankAccount, data)
  end
end

The Command Handler

There is only one missing piece, the entry point. The Command Handler is responsible for that, each command handler specifies what aggregate and event handler should be used.

defmodule Bank.BankAccountCommandHandler do
  use Incident.CommandHandler,
    aggregate: Bank.BankAccount,
    event_handler: Bank.BankAccountEventHandler
end

With that in place, you can start issuing commands via Bank.BankAccountCommandHandler.receive/1:

# Create a command to open an account
iex 1 > command_open = %Bank.Commands.OpenAccount{account_number: Ecto.UUID.generate()}
%Bank.Commands.OpenAccount{account_number: "10f60355-9a81-47d0-ab0c-3ebedab0bbf3"}

# Successful command for opening an account
iex 2 > Bank.BankAccountCommandHandler.receive(command_open)
{:ok,
 %Incident.EventStore.PostgresEvent{
   __meta__: #Ecto.Schema.Metadata<:loaded, "events">,
   aggregate_id: "10f60355-9a81-47d0-ab0c-3ebedab0bbf3",
   event_data: %{
     "account_number" => "10f60355-9a81-47d0-ab0c-3ebedab0bbf3",
     "aggregate_id" => "10f60355-9a81-47d0-ab0c-3ebedab0bbf3",
     "version" => 1
   },
   event_date: #DateTime<2020-10-31 23:17:30.074416Z>,
   event_id: "072fcfce-9521-4432-a2e0-517659590556",
   event_type: "AccountOpened",
   id: 5,
   inserted_at: #DateTime<2020-10-31 23:17:30.087480Z>,
   updated_at: #DateTime<2020-10-31 23:17:30.087480Z>,
   version: 1
}}
 
# Failed command as the account number already exists
iex 3 > Bank.BankAccountCommandHandler.receive(command_open)
{:error, :account_already_open}

# Fetching a specific bank account from the Projection Store based on its aggregate id
iex 4 > Incident.ProjectionStore.get(Bank.Projections.BankAccount, "10f60355-9a81-47d0-ab0c-3ebedab0bbf3")
%Bank.Projections.BankAccount{
  __meta__: #Ecto.Schema.Metadata<:loaded, "bank_accounts">,
  account_number: "10f60355-9a81-47d0-ab0c-3ebedab0bbf3",
  aggregate_id: "10f60355-9a81-47d0-ab0c-3ebedab0bbf3",
  balance: 0,
  event_date: #DateTime<2020-10-31 23:17:30.074416Z>,
  event_id: "072fcfce-9521-4432-a2e0-517659590556",
  id: 2,
  inserted_at: #DateTime<2020-10-31 23:17:30.153274Z>,
  updated_at: #DateTime<2020-10-31 23:17:30.153274Z>,
  version: 1
}

# Fetching all events for a specific aggregate id
iex 5 > Incident.EventStore.get("10f60355-9a81-47d0-ab0c-3ebedab0bbf3")
[
  %Incident.EventStore.PostgresEvent{
    __meta__: #Ecto.Schema.Metadata<:loaded, "events">,
    aggregate_id: "10f60355-9a81-47d0-ab0c-3ebedab0bbf3",
    event_data: %{
      "account_number" => "10f60355-9a81-47d0-ab0c-3ebedab0bbf3",
      "aggregate_id" => "10f60355-9a81-47d0-ab0c-3ebedab0bbf3",
      "version" => 1
    },
    event_date: #DateTime<2020-10-31 23:17:30.074416Z>,
    event_id: "072fcfce-9521-4432-a2e0-517659590556",
    event_type: "AccountOpened",
    id: 5,
    inserted_at: #DateTime<2020-10-31 23:17:30.087480Z>,
    updated_at: #DateTime<2020-10-31 23:17:30.087480Z>,
    version: 1
  }
]

Recap

With Incident you can implement Event Sourcing in specific domains in your application. The library will take care of the foundation pieces internally, guiding you through some of the Event Sourcing aspects and components, but at the same time, giving you total control and flexibility in terms of your application logic and processing. With all of the good things present in Elixir, you can decide, for example, what level of concurrency you want, if you need, but Incident will not force you.

The idea behind Incident is being a framework for the Event Sourcing related things, relying on adopted persistence solutions like Ecto, but allowing your application to leverage what you prefer on anything else.

Lots of the code in this first post of the series were setting up things that are very common in most Elixir applications that use Postgres, and some one-time Incident configuration.

What Comes Next?

The next posts will cover other use-cases in the bank application and the majority of the code will be logic to declare the next operations, some of them, simple, but others, complex.

Check the Incident Github repo for the complete bank application example, including lots of integration tests. Feel free to share it and contribute, the library is somehow new, and any feedback is welcomed. See you soon!

Permalink

Delivering social change with Elixir at Change.org

Welcome to our series of case studies about companies using Elixir in production. See all cases we have published so far.

Change.org is a social change platform, with over 400 million users worldwide. Two years ago, their engineering team faced a challenge to migrate their messaging system from an external vendor to an in-house solution, to reduce costs and gain flexibility.

This article will discuss how they approached this problem, why they chose Elixir, and how their system grew to deliver more than 1 billion emails per month. Change.org is also hiring Elixir engineers to join their team.

Change.org

The path to Elixir

The first step for Change.org’s engineering team was to outline the requirements for their system. The system would receive millions of events, such as campaign updates, new petitions, and more, and it should send emails to all interested parties whenever appropriate. They were looking for an event-driven solution at its core, in which concurrency and fault-tolerance were strong requirements.

The next stage was to build proofs-of-concept in different programming languages. Not many companies can afford this step, but Change.org’s team knew the new system was vital to their business and wanted to be thorough in their analysis.

Around this time, John Mertens, Director of Engineering, was coming back from parental leave. He used this opportunity to catch up with different technologies whenever possible. That’s when he stumbled upon José Valim’s presentation at Lambda Days, which discussed two libraries in the Elixir ecosystem: GenStage and Flow.

They developed prototypes in four technologies: JRuby, Akka Streams, Node.js, and Elixir. The goal was to evaluate performance, developer experience, and community support for their specific use cases. Each technology had to process 100k messages as fast as possible. John was responsible for the Elixir implementation and put his newly acquired knowledge to use.

After two evaluation rounds, the team chose to go ahead with Elixir. Their team of 3 engineers had 18 months to replace the stack they had been using for the last several years with their own Elixir implementation.

Learning Elixir

When they started the project, none of the original team members had prior experience with Elixir. Only Justin Almeida, who joined when the project had been running by six months, had used Elixir before.

Luckily, the team felt supported by the different resources available in the community. John recalls: “We were in one of our early meetings discussing how to introduce Elixir into our stack when Pragmatic Programmers announced the Adopting Elixir book, which was extremely helpful in answering many of our questions.”

The new system

The team developed three Elixir applications to replace the external vendor. The first application processes all incoming events to decide whether an email should go out and to whom.

The next application is the one effectively responsible for dispatching the emails. For each message, it finds the appropriate template as well as the user locale and preferences. It then assembles the email and delivers it with the help of a Mail Transfer Agent (MTA).

The last application is responsible for analytics. It receives webhook calls from the MTA with batches of different events, which are processed and funneled into their data warehouse for later use.

After about four months, they put the new system in production. While Change.org has dozens of different email templates, the initial deployment handled a single and straight-forward case: password recovery.

Once the new system was in production, they continued to migrate different use cases to the system, increasing the numbers of handled events and delivered emails day after day. After one year, they had completed the migration ahead of schedule.

Handling spikes and load regulation

Today, those applications run on a relatively small number of nodes. The first two applications use 6 to 8 nodes, while the last one uses only two nodes.

John explains they are over-provisioned because spikes are relatively frequent in the system: “for large campaigns, a single event may fan out to thousands or hundreds of thousands of emails.”

The team was kind enough to share some of their internal graphs. In the example below, you can see a spike of over 10 million messages coming to the system:

Usage at Change.org

Once this burst happens, all nodes max their CPUs, emitting around 3000 emails per second until they drain the message queue. The whole time memory usage remains at 5%.

The back-pressure provided by the GenStage library played a crucial role in the system’s performance. Since those applications fetch events from message queues, process them, and submit them into third-party services, they must avoid overloading any part of the stack. GenStage addresses this by allowing the different components, called stages in the library terminology, to communicate how much data they can handle right now. For example, if sending messages to the MTA is slower than usual, the system will naturally get fewer events from the queue.

Another essential feature of the system is to work in batches. Receiving and sending data is more efficient and cost-effective if you can do it in groups instead of one-by-one. John has given a presentation at ElixirConf Europe sharing the lessons learned from their first trillion messages.

The activity on Change.org has grown considerably over the last year too. The systems have coped just fine. Justin remarks: “everything has been working so well that some of those services are not really on our minds.”

Working with the ecosystem

Change.org has relied on and contributed to the ecosystem whenever possible. During the migration, both old and new systems had to access many shared resources, such as HAML templates, Ruby’s I18N configuration files, and Sidekiq’s background queues. Fortunately, they were able to find compatible libraries in the Elixir ecosystem, respectively calliope, linguist, and exq.

Nowadays, some of those libraries have fallen out of flavor. For example, the community has chosen gettext for internationalization, as it is a more widely accepted format. For this reason, Change.org has stepped in and taken ownership of the linguist library.

As Change.org adopted Elixir, the ecosystem grew to better support their use cases too. One recent example is the Broadway library, which makes it easy to assemble data pipelines. John explains: “Broadway builds on top of GenStage, so it provides the load regulation, concurrency, and fault-tolerance that we need. It also provides batching and partitioning, which we originally had to build ourselves. For new projects, Broadway is our first choice for data ingestion and data processing.”

Elixir as the default stack

As projects migrate to Elixir, Elixir has informally become the default stack at Change.org for backend services. Today they have more than twenty projects. The engineering team has also converged on a common pattern for services in their event driven architecture, built with Broadway and Phoenix.

In a nutshell, they use Broadway to ingest, aggregate, and store events in the database. Then they use Phoenix to expose this data, either through APIs, as analytics or as tooling for their internal teams.

One recent example is Change.org’s Bandit service. The service provides a Phoenix API that decides which copy to present to users in various parts of their product. As users interact with these copies, data is fed into the system and analyzed in batches with Broadway. They use this feedback to optimize and make better choices in the future.

The team has also grown to ten Elixir developers thanks to the multiple training and communities of practice they have organized internally. Change.org is also looking for Elixir backend engineers, as they aim to bring experience and diversity to their group. Interested developers can learn more about these opportunities on their website.

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.