OTP 24.0 Release Candidate 1

img src=http://www.erlang.org/upload/news/

OTP 24 Release Candidate 1

This is the first of three planned release candidates before the OTP 24 release.
The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you.

We encourage users to try it out and give us feedback either by creating an issue here https://github.com/erlang/otp/issues
or by posting to the mailing list erlang-questions@erlang.org.

Erlang/OTP 24 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new
features are highlighted below.

Highlights

erts, kernel, stdlib

  • The BeamAsm JIT-compiler has been added to Erlang/OTP and will give a significant performance boost for many applications.
    The JIT-compiler is enabled by default on most x86 64-bit platforms that have a C++ compiler that can compile C++17.
    To verify that a JIT enabled emulator is running you can use erlang:system_info(emu_flavor).

  • A compatibility adaptor for gen_tcp to use the new socket API has been implemented (gen_tcp_socket).

  • Extended error information for failing BIF calls as proposed in EEP 54 has been implemented.

  • Process aliases as outlined by EEP 53 has been introduced.

compiler

  • Compiler warnings and errors now include column numbers in addition to line numbers.
  • Variables bound between the keywords 'try' and 'of' can now be used in the clauses following the 'of' keyword
    (that is, in the success case when no exception was raised).

ftp

Add support for FTPES (explicit FTP over TLS).

ssl

  • Support for the "early data" feature for TLS 1.3 servers and clients.
  • Make TLS handshakes in Erlang distribution concurrent.

wx

  • The application has been completely rewritten in order
    to use wxWidgets version 3 as its base.
  • Added support for wxWebView.

edoc

  • EDoc is now capable of emitting EEP-48 doc chunks. This means that, with some configuration, community projects
    can now provide documentation for shell_docs the same way that OTP libraries did since OTP 23.0.

For more details about new features and potential incompatibilities see
https://erlang.org/download/OTP-24.0-rc1.README

Pre built versions for Windows can be fetched here:
http://erlang.org/download/otp_win32_24.0-rc1.exe
http://erlang.org/download/otp_win64_24.0-rc1.exe

Online documentation can be browsed here:
http://erlang.org/documentation/doc-12.0-rc1/doc/

The Erlang/OTP source can also be found at GitHub on the official Erlang repository,
https://github.com/erlang/otp

Permalink

Social messaging with Elixir at Community

Community is a platform that enables instant and direct communication with the people you want to reach, using the simplicity of text messaging. Used by names like Paul McCartney, Metallica, and Barack Obama, Community connects small businesses, stars, and high-profile individuals directly to their audiences.

Community is powered by the Erlang Ecosystem, with Elixir and RabbitMQ playing central roles. This article gives an overview of the system and the tools used to handle spikes of million of users caused by events such as this tweet:

The first steps with Elixir

Tomas Koci and Ustin Zarubin were the two engineers behind Community’s initial implementation. The company was pivoting from a product they had written in Go and they felt the language was not expressive enough for the products they were building. So when faced with the challenge of developing a social messaging platform on top of SMS, they were open to trying a different stack.

Their first encounter with Elixir was a casual one. They were chatting about the challenges ahead of them when their roommate mentioned Elixir. Shortly after, things started to click. They both had a physics background, so they found the functional paradigm quite intuitive. The Erlang VM also has its origins in telecommunications, and they were building a telecom centric product, which gave them more confidence.

Besides the technological aspect, they also began to be active in the Elixir community. Tomas recaps: “we started attending the Elixir meetups happening here in Chattanooga. We met many developers, heard about production cases, and learned how companies like Bleacher Report were using Elixir at scale”. From then on, they were sold on giving Elixir a try.

They started their prototype in January 2018, with the intent of onboarding dozens of users. They were learning Elixir while developing the system and reaching out to potential users.

Their first challenge was in May 2018, when one of their users announced his phone number, managed by Community, to millions of viewers. Tomas still remembers that day: “It was a Saturday night, around 11:00 pm when we saw an influx of users. It caught us by surprise and, after 10 hours, more than 400 thousand users had signed up”. This influx of users stressed the system in unexpected ways, especially when it came to their upstream integrations. They had to patch the system to ensure they would not overload external systems or run over API limits they were required to conform to.

This event also gave them insights into the types of spikes and traffic patterns the system would have to handle at scale. Early engineering hire Jeffrey Matthias urged them to break their application into different services, making it easy to scale each service individually, and he and Tomas decided to have those services communicate via message queues.

The next millions of users

By October 2018, the company received funding and the newly-hired engineering team of five people, began to split the original application into services that could handle sharp increases in demand and operate at scale. Shortly after, they had their next challenge in hand: Metallica had just signed up with the platform and they were going to do an early launch with their fans on Feb 1st, 2019.

The team is glad to report the announcement was a success with no hiccups on their end. They were then five backend engineers who tackled everything from architectural design and development to setting up and operating the whole infrastructure.

Community was officially unveiled in May 2019, attracting hundreds of music stars shortly after. Fourteen months later, Barack Obama tweeted to millions his phone number powered by Community.

The current architecture

Today, more than 60 services with distinct responsibilities power Community, such as:

  • A message hub between community leaders and members
  • User data management
  • Media services (video, audio, images)
  • Systems for Community’s internal team
  • Data science and machine learning
  • Billing, administration, etc

The vast majority of those services run Elixir, with Python covering the data science and machine learning endpoints, and Go on the infrastructure side.

RabbitMQ handles the communication between services. The Erlang-backed message queue is responsible for broadcasting messages and acting as their RPC backbone. Messages between services are encoded with Protocol Buffers via the protobuf-elixir library.

Initially, they used the GenStage library to interface with RabbitMQ, but they have migrated to the higher level Broadway library over the last year. Andrea Leopardi, one of their engineers, outlines their challenges: “Our system has to handle different traffic patterns when receiving and delivering data. Incoming data may arrive at any time and be prone to spikes caused by specific events powered by actions within Communities. On the other hand, we deliver SMSes in coordination with partners who impose different restrictions on volumes, rate limiting, etc.”

He continues: “both GenStage and Broadway have been essential in providing abstractions to handle these requirements. They provide back-pressure, ensure that spikes never overload the system, and guarantee we never send more messages than the amount defined by our delivery partners”. As they implemented the same patterns over and over in different services, they found Broadway to provide the ideal abstraction level for them.

Their most in-demand service, the message hub, is powered by only five machines. They use Apache Mesos to coordinate deployments.

Growing the team

Community’s engineering team has seen stable growth over the last two years. Today they are 25 backend engineers, the majority being Elixir devs, and the company extends beyond 120 employees.

Karl Matthias, who joined early on, believes the challenges they face and the excitement for working on a new language has been positive for hiring talent. He details: “we try to hire the best production engineers we can, sometimes they know Elixir, sometimes they don’t. Our team has generally seen learning Elixir as a positive and exciting experience”.

The team is also happy and confident about the stability Elixir provides. Karl adds: “Elixir supervisors have our back every time something goes wrong. They automatically reestablish connections to RabbitMQ, they handle dropped database connections, etc. The system has never gone wrong to the point our infrastructure layer had to kick-in, which has been quite refreshing.”

The Community team ended our conversation with a curious remark. They had just shut down their first implementation of the system, the one that received a sudden spike of four hundred thousand users on a Saturday night. Tomas concludes: “it is pretty amazing that the service we implemented while learning Elixir has been running and operating in production just fine, even after all of these milestones. And that’s generally true for all of our services: once deployed, we can mostly forget about them”.

Permalink

How to ensure your Instant Messaging solution offers users privacy and security.

Concerns around privacy and security have become a big talking point this year. There have been a number of major Instant Messaging providers who have been criticised for the way that their apps collect, store and share the information of their users. In amongst the mountains of data collected most apps will receive information about users interests, behavioural patterns and location. Users’ privacy concerns have caused an unexpected and unwanted complication for many companies using enterprise versions of these chat applications. Firstly, if customers are turning away from the specific chat provider a business has chosen to adopt, it can create a barrier between the company and their customers. Secondly, for businesses in regulated industries such as FinTech or Healthcare, a chat provider’s ability to change their privacy and security terms and conditions is an unacceptable risk. The best way for a business to be sure that their enterprise instant messaging solution is secure and meets the privacy demands of its users is to have control of the privacy and security setting of the chat application. With most out-the-box solutions, control over privacy and security is rigid and dictated by the chat provided.

Privacy by default, customisable by design.

MongooseIM is built with maximum user privacy levels as the default, which means anyone using a standard implementation of MongosseIM has an extremely private, secure chat server that has been approved by some of the most stringent regulatory boards. On top of that, you control your chat application giving you the ability to format the settings to suit your company’s needs and your users. Below are a list of privacy considerations built into MongooseIM.

Minimise and limit

The minimise and limit principle regards the amount of personal data gathered by a service. The general principle here is to take only the bare minimum required for a service to run instead of saving unnecessary data just in case. If more data is taken out, the unnecessary part should be deleted. Luckily, MongooseIM is using only the bare minimum of personal data provided by the users and relies on the users themselves to provide more if they wish to - e.g. by filling out the roster information. Moreover, since it is implementing XMPP and is open source, everybody has an insight as to how the data is processed.

Hide and protect

The hide and protect principle refers to the fact that user data should not be made public and should be hidden from plain view disallowing third parties to identify users through personal data or its interrelation. We have tackled that by handling the creation of JIDs and having recommendations regarding log collection and archiving. What is this all about? See, JIDs are the central and focal point of MongooseIM operation as they are user unique identifiers in the system. As long as the JID does not contain any personally identifiable information like a name or a telephone number, the JID is far more than pseudo-anonymous and cannot be linked to the individual it represents. This is why one should refrain from putting personally identifiable information in JIDs. For that reason, our release includes a mechanism that allows automatic user creation with random JIDs that you can invoke by typing ‘register’ in the console. Specific JIDs are created by intentionally invoking a different command (register_identified).
Still, it is possible that MongooseIM logs contain personally identifiable information such as IP addresses that could correlate to JIDs. Even though the JID is anonymous, an IP address next to a JID might lead to the person behind it through correlation. That is why we recommend that installations with privacy in mind have their log level set to at least 'warning’ level in order to avoid breaches of privacy while still maintaining log usability.

Separate and aggregate

The separate principle boils down to partitioning user data into chunks rather than keeping them in a monolithic DB. Each chunk should contain only the necessary private data for its own functioning. Such a separation creates issues when trying to identify a person through correlation as the data is scattered and isolated - hence the popularity of microservices. Since MongooseIM is an XMPP server written in Erlang, it is naturally partitioned into modules that have their own storage backends. In this way, private data is separated by default in MongooseIM and can be also handled individually - e.g. by deleting all the private data relating to one function.
The aggregation principle refers to the fact that all data should be processed in an aggregated manner and not in one focused on detailed personal cases. For instance, behavioural patterns should be representative of a concrete, not identifiable cohort rather than of a certain Rick Sanchez or Morty Smith. All the usage data being processed by MongooseIM is devoid of any personally identifiable traits and instead tracks metrics relevant to the health of the server. The same can be said for WombatOAM if you pair it with MongooseIM. Therefore, aggregation is supported by default.

Privacy by default

It is assumed that the user should be offered the highest degree of privacy by default. This is highly dependant on your own implementation of the service running on top of MongooseIM. However, if you follow our recommendations laid out in this post, you can be sure you implement it well on the backend side, as we do not differentiate between the levels of privacy being offered.

The Right of Access

As users privacy concerns go, so too does the likelihood that a user will request to see the data that your chat application has stored on them. With MongooseIM we have put a lot of effort in order to make the retrieval as painless as possible for system administrators that oversee the day to day operations. That is why we have developed a mechanism you can start by executing the retrieve_personal_data command in order to collect all the personal and derivative data belonging to a user behind a specific JID. The command will execute for all the modules no matter if they are enabled or disabled. Then, all the relevant data is extracted per module and is returned to the user in the form of an archive.
In order to facilitate the data collection, we have changed the schemas for all of our MAM backends. This has been done to allow a swift data extraction since up till now it was very inefficient and resource hungry to run such a query. Of course, we have prepared migration strategies for the affected backends.

The Right to be Forgotten

The right to be forgotten is another one that goes alongside the right of access. Each user has the right to remove their footprint from the service. Since we know retrieval from all the modules listed above is problematic, removal is even worse.
We have implemented a mechanism that removes the user account leaving behind only the JID. You can run it by executing the “unregister” command. All of the private data not shared with other users is deleted from the system. In contrast, all of the private data that is shared with other users - e.g. group chats messages or PubSub flat nodes - is left intact as the content is not only owned by one party. Logs are not a part of this action. If the log levels are set at least to 'warning’, there is no personal data that can be tied to the JIDs in the first place so there is no need for removal.

Long term peace of mind

The recent privacy concerns of major messaging provider has created an unwanted hurdle for many businesses, but one that can be easily overcome. They do however serve as a good example of one of the broader problems with choosing an out-the-box, software-as-a-service provider for your chat solutions. Most third party products are offered as one-size-fits all, meaning any changes made by the owner will impact your account, this creates an uncontrolled liability for your business. MongooseIM offers an easily manageable alternative to software-as-a-service providers. With us, you’ll be able to utilise an extremely robust, battle-tested messaging server, that has been used by many of the world’s most used chat applications. In doing so, you’ll have a solution that is scalable, reliable but customisable and owned by your business. Learn more on our MongooseIM product page

Permalink

Type systems and checking in Elixir and Ruby

I must admit – when it comes to programming languages, I have a type: robust type systems!

Ever since learning Elm, I’ve fallen in love with programming with a expressive type system. Since I work in other languages as well which are dynamically typed, I find myself yearning for a more robust type system and the guarantees of a static type checker. So I started exploring options for adding types to some of the languages in which I work most frequently, Ruby and Elixir.

Robust type systems are also becoming increasingly popular – it’s not just me who loves them! Type systems and checking eliminates a whole genre of errors – runtime errors – and it does more than that too: it offers a means of modeling your domain, enforcing contracts and consistency within code, and documenting your code to enhance readability and intent. There are lots of ways that types can benefit a project, and much of the tooling out there lets you gradually add types to your codebase, so you can test out how it’s benefitting you before making a commitment.

As more people have been seeing the value of working with an expressive type system, more tooling has been created for gradually implementing them in dynamically typed languages, and I want to share about what’s emerging in this area.

While I’ll be focusing on Ruby and Elixir, TypeScript very much deserves mention here as well, since it’s an option for typing with JavaScript. Because it’s already more widely used, I want to concentrate on what’s emerging in other languages, but its growing popularity is testament to the helpfulness of strong typing. Elm is another favorite language of mine that we’ve written about a bunch at thoughtbot, which is considered to have a very robust type system.

Note on vocabulary

Before we begin, let’s review some vocabulary.

“Statically typed” means that types are checked at compile time. “Dynamically typed” means that types are checked at runtime. We’ll use these terms to formally categorize languages.

A “type system” is a language’s set of rules that govern its constructs, such as varibles, functions, and components likes strings and integers, or other data structures like maps and lists.

“Strongly typed” or “strictly typed” means that a language’s type system is robust, expressive, and strictly enforced. A “weak” type system by contrast is one which has more permissive rules and which and does not support features for expressing more complex types. “Strong”and “weak” are therefore not formal technical terms, and rather are used to compare different type systems based on how robust they are relative to each other. So we can think of type systems on a range from more primitive, or weak, to more expressive, or strong.

Elixir

Elixir offers some means of working with types out of the box, which is really cool. Using typespecs and tooling for static analysis, one can mimick many aspects of a static type system. There’s also an exciting new language in development, Gleam, which, not to be reductive, is like Elixir with type checking, and it is pretty awesome.

Typespecs

Elixir offers typespecs as an opt-in feature. Typespecs are used to annotate function signatures (also called specifications), and for defining custom types.

On any Elixir function, you can add a typespec as annotation to state the types of arguments it expects and the type which will be returned. Typespecs are like documentation and they are not evaluated at runtime, so if you make a mistake with them, that won’t affect how the code runs.

Typespecs can also be used to define custom types in Elixir, a feature of stronger type systems.

Typespecs alone are valuable, even if you don’t combine them with other tooling to enforce type checking. I like to think of a type annotation as documentation, and as a contract. When you add a specification, it’s like making a promise that your code does what you’re saying it does. It clarifies intent for other readers of the code. If you use typespecs to define custom types, you can use those types to expressively model your domain.

If you’re interested in the benefits of a type checker, you have some options to layer onto typespecs.

You can enable an IDE to highlight any mismatches between your typespecs and code so you can identify issues as you work (I recommend Visual Studio Code + VSCode Elixir Plugin for this). This plugin will even show you auto-generated annotations to make adding them less work, and you’ll see highlights when you’ve created a mismatch, saving time as you develop.

Typespecs can also be combined with a tool such as Dialyzer to do type checking on your behalf outside of when you run the program, thus mimicking a static type checker.

See the documentation here to go deeper on how to use typespecs, such as mixing them in with guards or using opaque types.

Finally, it’s worth noting that Elixir offers some means of making “contracts” within code beyond type checking, like sum types or Behaviors. These are outside the scope of this article but I encourage you to check out these links if you’re interested in going deeper.

Gleam

The other option to consider for Elixir is Gleam, an exciting new language in development. Again, not to be reductive, but Gleam is like Elixir + robust static typing. I tried it out and it felt exactly like Elixir, except I could build custom types with ease and rely on the type checker. I did struggle a bit to get my Gleam environment set up because some of the requirements involved different versions of Erlang OTP and rebar than what I had installed already, but I was still able to get going. I have not tried adding Gleam to a Phoenix application but it seems it can be done! I hope to try that out soon.

I think that currently it’s more practical in Elixir to rely on the built in support for typing than it is to switch to Gleam, but I’m excited about where Gleam is going and think it could prove to be a really excellent language.

Ruby

In Ruby land, there’s a growing suite of tools for type checking around existing Ruby code - Sorbet, RBS, and, excitingly, the upcoming release of Ruby 3 which will ship with support for type checking.

Sorbet

Sorbet is a type checker designed for Ruby and built by Shopify as they needed a way to maintain and scale their very large code base. Sorbet lets you add type checking to your Ruby code one file at a time. I recently tried adding it to a project, and it was easy to set up initially.

One weird thing you run into with Sorbet is that metaprogramming is popular in Ruby and especially so in Rails, where many functions are defined at runtime, and therefore don’t yet exist when type checked ahead of runtime. For cases like this, you can declare types of these methods ahead of time and Sorbet will know to use them. It’s a little counterintuitive to add types for methods that don’t yet exist, but it works.

If you’re looking to add Sorbet to a Rails project, I’d recommend sorbet-rails which offers tooling to help auto-generate types based on patterns in Rails, like column getter methods. It would be a lot of redudant work to add type checking to everything in a Rails project, so this tool thankfully automates that away.

RBS and Ruby 3

RBS is a language for defining types that will be used in Ruby 3, which will ship with support for type annotations. You can also use it independently of Ruby 3. Sorbet will be incorporating RBS as a means of adding type specifications, and you can read about the ongoing development here. Because of this, Sorbet is concentrating on other features while the Ruby team focuses on the RBS side of things. For now, if you’re looking to get going with types in Ruby, I’d recommend Sorbet (and Sorbet Rails) because it has more tooling and documentation available, which will still be relevant when Ruby 3 ships.

Give it a try!

If you’re working in a dynamically typed language, consider if adding a more robust type system and static type checking would benefit your code. For many popular language ecosystems, there is tooling that is emerging for this. It’s been my experience that adding a type system can help get rid of errors, aid in modeling the domain, and clarify the intent of your code for others to read, all which helps to scale and maintain big projects. Importantly, you don’t have to switch over to types all at once, instead opting to gradually add them in. Consider trying it out!

Permalink

Orchestrating computer vision with Elixir at V7

V7 is a web platform to create the sense of sight. A hub for machine learning and software engineers to develop their computer vision projects with data set management, image/video labeling, and one-click model training to automate any visual task.

Founded in 2018 by Alberto Rizzoli and Simon Edwardsson, V7 uses Elixir, Phoenix, and Cowboy to power their web platform, responsible for managing large amounts of data and orchestrating dozens of Python nodes to carry out machine learning jobs. They have recently closed a $3M seed round, and they are currently hiring backend engineers to augment their Elixir team.

V7

Visual tasks

Throughout the years, we have been continuously automating visual tasks to speed up manual processes and reduce the rate of errors. For example:

  • Routine inspection of infrastructure: oil pipelines and offshore oil rigs require constant examination against corrosion. Once there is too much rust, it can damage the pipeline and cause leakage. Nowadays, you can use drones to take pictures and automate the detection of oxidated spots.

  • Medical examination: there is a growing use of digital pathology to assist doctors in diagnosing diseases. For example, during a biopsy of possible liver cancer, doctors use a microscope to visualize human tissue and stitch together an image of the cells, which are then individually analyzed. AI can double-check these images and help speed up problematic cells in case of positives.

  • Agriculture and farming: a wine producer may want to count grapes in a vineyard to estimate the wine production for a given season with higher precision. Farmers may use video to assess the health and the amount of exercise on free-range chickens and pigs.

  • Visual automation also plays a growing role in quality assurance and robotics: a fast-food manufacturer can use cameras to identify fries with black spots, while harvesters may use robots to pick apples from trees.

Neural networks are at the heart of these tasks, and there is a growing need to automate the creation of the networks themselves.

Automating AI

Training a neural network for image and video classification often requires multiple steps. First, you annotate images and frames with bounded-boxes, polygons, skeletons, and many other formats. The annotations are then labeled and used to train computer vision models. Labeled annotations are also used to verify models against biases, outliers, and over/underfitting.

For many AI companies, this process exists in a loop as they continuously refine datasets and models. V7 helps teams manage and automate these steps, accelerating the creation of high-quality training data by 10-100x. Users may then export this data or use it to create neural networks directly via the platform.

V7 uses Elixir to orchestrate all of these tasks. The front-end is a Vue.js application that talks to a Phoenix-powered API. The Phoenix application has to work with a large amount of data across a wide variety of formats. For example, a microscope outputs images in a different format, often proprietary, than a regular laboratory camera.

To perform all the machine learning tasks, V7 has a cluster of Python nodes orchestrated by an Elixir application running the Cowboy webserver. Once a Python node comes up, it establishes a WebSocket connection with Cowboy and sends how much memory, CPU, GPU, and other relevant data it has available.

The Phoenix-powered backend communicates with the orchestrator using another Erlang VM-based technology: RabbitMQ. For example, when the user tasks to auto-annotate an image, the Vue.js front-end sends a REST request to Phoenix. Phoenix then enqueues a message on RabbitMQ with the image’s location, typically an Amazon S3 bucket. The orchestrator picks this message up, finds an available Python node, and delivers the relevant instructions via WebSockets.

Ecosystem and Infrastructure

Other tools used by the V7 team are Broadway and the Erlang Distribution.

V7 has to process and normalize images and videos. For these, they have a separate service that receives RabbitMQ messages and invokes ImageMagick or FFmpeg accordingly. They use Broadway to receive RabbitMQ messages and to execute these tasks concurrently.

The Erlang Distribution helps them broadcast information across nodes. Since they store their multimedia data on S3, they need to generate pre-signed URLs whenever the user wants to see an image or video. However, if users are routed to a different node, they would get a different URL, which would force them to download the asset again. To address this, they use the Erlang Distribution to communicate which URLs they have generated and for which purposes.

Overall, their backend runs on Amazon ECS on about four nodes, which talk directly to PostgreSQL. The largest part of their infrastructure is the Python cluster, which takes up to two dozens of machines.

Learning and Hiring

Elixir has been present inside the company since day one, back in August 2018. Andrea Azzini, the first engineer at V7, was the one responsible for introducing it. He believed the language would be a good fit for the challenges ahead of them based on his experience running Elixir in production.

Simon Edwardsson, their CTO, had to learn the language as they developed the system, but he was able to get up and running quickly, thanks to his previous experiences with Python and Haskell. He remarks: “As a team, we were more familiar with Django, but we were concerned it would not handle well the amount of data and annotations that we manage - which could lead to rewrites or frustrations down the road. From this perspective, the investment in Elixir was worth it, as we never had to do major changes on our backend since we started.”

Part of this is thanks to Phoenix’s ability to provide high-level abstractions while making its building blocks accessible to developers: “While there is magic happening inside Phoenix, it is straight-forward to peek under the hood and make sense of everything.”

V7 has recently welcomed a new Elixir engineer to their team, making it a total of four, and they are looking for more developers interested in joining them. Historically, more engineers have applied to their machine learning positions, but they also believe many Elixir developers are prepared but don’t consider themselves ready. Simon finishes with an invitation: “We are primarily looking for backend engineers with either existing Elixir experience or willingness to learn on the job. If you are interested in automating computer vision across a large range of industries, we welcome you to get in touch.”

Permalink

Home Alone: a Post-Incident Review

2020/12/27

Home Alone: a Post-Incident Review

Over the last few years I've been reading more and more about resilience engineering and all the related material when it comes to Learning From Incidents. This stuff has a way to stick with you and change how you think, and I found myself adjusting my perspective on a ton of things.

As this is the holidays, Home Alone is on TV a lot, and it's part of our household's tradition to watch the first two. Over the years, I've grown to know the movie pretty well—both the original English and the French dub—and generally know what to expect. But this year, I ended up looking at it with a different eye. My perspective had shifted from my usual watch where things go absurdly bad and a bunch of hurried careless people leave Kevin behind, and he's stuck at home to fight burglars.

This time around I noticed all the little checks that were in place but nevertheless failed. It turns out the movie is surprisingly detailed on all these things and made a lot more attempts than I initially thought to make it almost inevitable that Kevin would be stuck home alone. So the thing I decided to do was write an incident investigation the way I would do them for work issues. I can't interview people, but I got a fully recorded movie to work with, along with a script draft found online, which sometimes contains additional or conflicting information with what the movie contains.

Approach

I wanted to make a bit of a comparison with the more usual "root cause analysis" as we see it in tech, often based on the five whys, which consist of asking why on each problem until we narrow to the actual root cause. The issue with this approach is manyfold:

  1. it is arbitrary in how many times you must ask "why?" We stop whenever we actually feel comfortable with the fault we’ve found, not necessarily because it's good.
  2. it is reductive in the pathways it makes you follow because it tends to follow direct causal paths and missing important but indirect contextual cues
  3. it is done backwards from the known consequences of the incident, and is inherently going to be tinted by hindsight and outcome bias, making faults look obvious in retrospect when they were not at the time.

The approach I wanted to take instead is one where we focus on the messy details, to understand the challenges people were facing. One way to accomplish this is to do a deep dive in the events as they unfolded, and to do it from a perspective of local bounded rationality with the assumption that people are trying to do a good job. By re-constructing events as they happen in time, we can hope to reduce hindsight bias partially and lower the influence of outcome bias. Decisions have to be framed with the expected intents and the information people had rather than fully knowing the consequences after they had happened if we want to be able to have real impact.

Finally, this is not a serious study of anything. I did this because I found the idea amusing, but did not run this through that much care and didn't apply a ton of professionalism.


The Report

This report looks at events that have transpired in December 1990, where a family living in Chicago and on their way to France for Christmas ended up leaving their eight years old son home alone. The case has gotten worldwide recognition due to the creative ways in which the aforementioned youngest child managed to foil the plans of two burglars through creative and rather violent use of home supplies.

This text mainly focuses on the circumstances surrounding the incident. The world at large was surprised and outraged at the events, generally calling the family irresponsible for the events, and wondering why the police—who had been involved at the time this was taking place—apparently did nothing. We will cover a brief timeline of the events leading up to the youngest child being left alone, and what happened to reunite him with the rest of the family. We are not concerned about the improvised home defense systems that were improvised, and aim to shed enough light onto the incident to allow a reasonable chance at both prevention and timely recovery, rather than advocating for ways of booby-trapping a house, which we believe would be impractical to render safe. A brief analysis with recommendations will follow.

Timeline

In late 1990, three brothers decided to reunite their respective families for the Holidays via a trip to Paris. The trip was being organized by the one brother living in Paris (also known to have owned real-estate in New York City), who was paying to fly in the families of his two other brothers. The events in question take place at the house of one of these two brothers, in Chicago, on December 22. Since they were leaving from O'Hare on the 23rd, both brothers' families were reunited there that night.

Altogether, the house contained 15 people: the parents who owned the house and their 5 children (including the one who was later forgotten), the other brother with his wife and 5 children (one of which was youngest of the bunch), and finally one of the daughters of the Parisian uncle. Things were hectic around the crowded house: sleeping arrangements had been reviewed and modified, and everyone scrambled to finish packing their suitcases before the end of the night.

The eldest son of the hosting family had ordered ten pizzas while adults were looking for voltage adapters before their departure on the following morning. It is reported that the youngest son, during that time, was feeling irritated by most of the remaining occupants, who were said to be dismissive of him—if not downright antagonistic—in all their hurry. These frequent conflicts had been a recurring theme in the household, and would still be a factor for this incident.

Everyone settles down for dinner: drinks are self-serve and all food is served with single-use cups and plates. Conflict erupts rapidly as the youngest of the host family sees his roommate (the youngest boy from the visiting family) drink Pepsi—which makes him wet the bed—while others ate all of the only pizza he wanted. In a fit of anger, he charged at his older brother, knocking him back and spilling milk all over the kitchen counter, the passports, and the plane tickets.

Home alone still: milk spilled over passports and tickets

The commotion propagates to the main table as the father gets up to salvage them, spills soda on the table, and the youngest kid gets wedged between his own father's chair and the wall at the dining table. While trying to dry the passports and tickets, the hosting father accidentally throws his son's ticket into the trash, before covering it with more similarly-colored napkins.

Home alone altered still: a ticket in the trash can, highlighted and with the name blurred

For pushing his older brother and being seen as the source of the conflict, the eight-years old attracts the ire of his entire extended family. His mother grounds him by sending him to bed directly, while also moving him from his bedroom to the attic: "fifteen people in this house and you're the only one acting up." This number fifteen will be critical in the next few hours, as we'll soon see.

One small victory for the child is that he will at the very least sleep alone, without his bed-wetting cousin.

During the night, strong winds blow over Chicago. At 4:37AM (10:37 UTC), the wind breaks a tree branch, which falls on an electric line and cuts all power to the house. This ends up resetting the hosts' alarm clock, a Panasonic RC-6067, which supports battery backup power. Unfortunately, this backup power source failed, and the family misses their wake up call.

Home alone altered still: the parent's alarm, highlighting the model and battery backup

At 8:00AM (14:00 UTC) sharp, two airport shuttles show up at the house, the driver confused by the lack of visible activity indoors. Their knocking on the door manages to awaken the host family's mother, who in turn alerts everyone about the situation. While everyone gets dressed and the household turns frantic, their across-the-street neighbour's kid—with the same age and height as the host family's youngest—invites himself into the front-yard, and starts talking with the shuttle drivers, who were loading the luggage that was all ready.

Inside the house, the mother starts delegating work. She approaches the oldest daughter of the French uncle, an adult attending Northwestern University (but who we suspect still sits at the kids' table and shall do so until she's 37) to do a headcount of everyone entering the airport shuttles. The father of the household is asked to retrieve the passports, which he had placed in the microwave to dry.

Kids start pouring out of the front door while their neighbour is already in a shuttle, asking tons of questions to an exasperated driver. The headcount is done by having the kids line up in front of one of the vans, the same one the neighbour is in. While the eldest runs the headcount, one of the kids randomly mentions numbers in an attempt to confuse her. She ends up skipping one of the girls and counting herself twice due to the interruption, and believes the neighbour (who is facing backwards and rummaging through someone's luggage) is one of the kids from the current household. The end result is that her headcount lines up properly: 11 kids as expected.

Home alone altered still: the headcount scene

Headcount sequence, with the neighbours' kid highlighted. Note that the headcounter arrives on the scene from the left, with the imposter already in place to the right.

She divides everyone into the two shuttle vans, but by the time the neighbour's kid turns around to reveal himself, she's already on her way to the other van and does not notice that she counted him as one of their own. The four adults exit the house and lock behind themselves. One of them mentions that they only have 45 minutes to go to catch their flight. Before the mother enters the shuttle, a working lineman announces that power is back on, but that the phone lines will remain cut for a few days.

In the shuttle, the mother asks the young adult to report on the headcount. She reports: "Eleven including me. Five boys, six girls, four parents, and a partridge in a pear tree." It's unclear whether she actually took the time to note the genders of everyone she counted, or if these were attributed post-hoc by believing the count to be exact.

The adults are divided up, one from each couple in each shuttle. It is important to point out that the act of dividing groups like that gives an immediate rational explanation for not being able to see any of the kids in the local shuttle: the count is correct, therefore the kid has to be in the other van. This can be believed regardless of whether the kid is there or stuck at home.

In the next scene we see the entire family running through the airport. The adults are constantly looking back; some hold the hands of their younger children while one of them literally carries the youngest one of the bunch in his arms. They make it to their gate at the last minute, where the father hands all the tickets (minus one, which was destroyed) to the airline employee.

She mentions that the remaining seats in coach are singles. Seats are unassigned there, and since the parents are flying in first class, they'll be set aside. The employee counts people rushing in, making sure that all tickets and people match up in number only. Here we have a discrepancy between the movie and the late script draft that was available; in the latter, no ticket is ever mentioned as being destroyed, but instead, the airline employee is reported to get an exact count (11 in coach, 4 in first class). In either source, everything looks normal to everyone involved, and there is no signal or alarm to make anyone suspect that anybody is missing.

Home alone still: the airport's quick headcount scene

Airport employee counting people as they enter the boarding tunnel, making sure their number matches tickets.

Everyone instead appears to be relieved that they managed to make it at the last moment, while the stewards rush everyone into their seats to leave on time. One of the parents mentions hoping they forgot nothing. Of course, by now everyone in the audience is aware of the irony that the youngest of the household has been left behind, unaware of all of the commotion by having been isolated in the attic, alone and fast asleep.

It takes a few hours before the mother suspects something is wrong. She has the gut feeling that they forgot something, and discusses it with her husband. They go over a mental checklist of items they had to cover and might have missed: coffee machine, locking up, closing the garage. Indeed, the husband had forgotten to close the garage and he figures that's what was missing and making his wife uneasy. She settles back down, only for the feeling to linger. She thinks for a few seconds more before realizing what they forgot is their son.

The adults are dismayed on the flight, blaming themselves and feeling horrible. They have tried reaching home through the captain's phone, but since it's out of order at their house, they're stuck waiting. Some of them try to minimize the guilt: "We'll call when we land", "we didn't forget him, we just miscounted", but the mother is still distraught and showing remorse.

As soon as the plane lands, the family rushes towards public phones. The mother gets her phone book out (a 2-inch thick document) and starts dispatching everyone to call contacts on their street. She personally calls the local police department to ask them to reach out to her son. They instead dispatch her to the Family Crisis Intervention center, where she asks for a check up once again, only to be redirected to the police. While she handles the bureaucracy, the rest of the family reports not being able to reach out to anyone. Apparently the whole neighborhood (at least those whose phone numbers they have) are out for the Holidays.

Finally, the police department sends an officer to the house, who knocks on the door. Since the child was afraid of one of their neighbours he had just encountered, he refuses to answer. The police officer, seeing no one and no damage, declared the house safe and asked for the parents to just count the kids again.

Ultimately, the family had to focus on getting back home as soon as possible to do things by themselves. No flights are available, not even private ones, until Friday morning, two days away. They end up dividing up into two swimlanes to help resolve things: the mother will remain at the airport as a standby to grab the first flight to America she can get, while the rest of the family will go to the home they were expected at in Paris to try the phones some more, at least until Friday when they can then fly back.

Once at the Parisian home, the father has trouble reaching anyone actually speaking English over the phone, and when he does they're all away for shopping or just not home. Some of the children feel bad about it, while others (mainly those who had ongoing quarrels with the one left behind) see it as some sort of well-deserved retribution with limited concerns otherwise ("we have smoke detectors [and] live in the most boring street in the United States where nothing dangerous will ever happen").

This appears to be the end of efforts as we know them from the family's side. The mother, still stuck in Paris, is haggling to see if she could trade or buy tickets from other travellers to make it home faster. She finally manages to trade a place for jewelry, cash, and first class tickets later in the week, and can board towards America.

Interestingly, by the time the adults have given up on reaching neighbours, it appears that phone communication was established back to the house since the child stuck home alone manages to order pizza. It is unclear why the rest of the family in Paris appears not to have tried again (or if they did whether they missed the child's presence who was running errands), but the phone isn't of consequence for the rest of events. There are various possible explanations ranging from the predicted time given by the worker at their door to having tried other ways to do things, but we have no information regarding what they are.

By December 24th, the mother had reached the United States via Dallas and was by then stranded in Scranton, but couldn't make it any closer to Chicago due to everything being booked, once again. By then she's been awake for about 60 hours. She meets a man in a Polka band who offers to drive her home in the back of a Budget rental moving truck while on their way to Milwaukee, which she agrees to more out of desperation (she mentions selling her soul if she must) than love of polka.

She makes it home a bit more than half a day later, on Christmas morning. The rest of the family drops home from their Friday morning flight a few minutes later, to the two others' surprise. The child makes peace with everyone, and describes his period home alone as thankfully pretty uneventful.

Analysis

This story shows one of the typical "perfect storms" that are characteristic of surprising incidents:

  • A ticket was accidentally thrown out while milk was spilled on both tickets and passport (prevents noticing at recount or gates)
  • Unexpected undoing of pairs/buddy system following dinner fight, with the kid being sent to the attic bedroom, which isolated him from the noise of everyone getting up
  • Loss of power, with a failing battery backup on the alarm clock
  • Heightened pressure due to being late with a ton of children to handle
  • Two shuttle vans with unassigned seats making it easy for unaccounted people to be elsewhere without causing concern
  • Presence of the neighbours' kid with a height and age similar to one of the children who were to be counted
  • One of the kids disrupting the van headcount of children on purpose
  • Lax security theatre in pre 9/11 days removes opportunities to match tickets to IDs, although the count of people accepted in still worked
  • Being this late likely meant nobody could check in their luggage on time (or so we assume), and everyone likely had to carry it on, removing another potential check from taking place.
  • Parents ended up with seats in first class but children in coach, with all seats unassigned, making it difficult to spot anyone missing.

Of particular interest here is how many planned fail-safes and checks were in place. Take any of these elements out and there's a good chance that nobody would have been left behind. All the tickets being intact would have prevented bad counts at the airport, a non-failing alarm would have removed hurry and pressure, the neighbor's kid not being around would have made the van headcount effective, and so on. All the items in the list needed to happen to be sure for one child to be forgotten.

We have to consider the context in place. If nobody thinks they could forget a child behind as an actual possibility, their focus is going to be on making sure they just make it on time for their trip and preparation. The various checks and headcounts would usually work, and just seeing the children would have one assume that they would notice any problems that would come up. This self-assurance, the divided vans, and absolute rush to make their flight had everyone focused on something else when they believed the headcounts to be good and complete. The race against the clock at the airport and unassigned seats in the plane did not help either.

The time pressure usually brings in compromises: not all work can be done as diligently since doing that would take too long. Hence, there has to be a tradeoff to try and get a reasonable output within the constraints. To minimize losses, they delegate tasks such as the headcount to others they believe trustworthy. An important step done is to still check back with each other to validate that everything has gone according to plan. Rather than removing the failsafes or dropping tasks, they reorganized the work to make sure everything still happens to the best they can. Obviously these pressures reduce the amount of attention that can be given to each task and a few were forgotten, such as closing the garage doors, to name one.

This is important to point out, because these compromises are done out of good intentions in an attempt to make things work. This means that any fix that focuses on adding safety checks that are time consuming risk, in practice, to see that fix omitted by the people in a hurry.

One thing that happens later, on the plane, is that feeling that something is wrong, not properly in place. This sort of intuition is typical of people slowly but unconsciously reevaluating their model of the world and understanding based on observations. Something clashes, but the amount of evidence has to add up until it tips the scales of what is expected and currently understood. The child's parents are aware of the rush and that they might have forgotten something, and they both recount things they've been doing over the morning to find what might be wrong. All their guesses are more mundane (and maybe considered more likely), and since none of them add up, the mother ends up realizing the problem: one child is missing.

They now enter recovery mode, where the intent is to fix the problem they have just identified. The first reflex is to try and reach out by phone, which they will keep on trying to do for hours. During down time, they blame themselves a lot. Furthermore, prompt recovery was rendered difficult:

  • Phone lines were cut for an extended period of time
  • The family's attempts with phones are slowed down by not speaking French
  • Neighbours were unreachable for help
  • The police department did minimal checks but did not trust the family's testimonies
  • Flights were overbooked and no seats were available

These events unfold in 1990, where for the most part, if you don't know someone's phone number, you just don't really get an easy way to reach them out rapidly. What we see here is the family trying the whole playbook, and opening up new swimlanes to get back to Chicago ASAP: the mother staying standby in hopes of getting home faster, and the rest of the family waiting on a certain result by Friday morning. To go faster, she trades her jewelry and tries to negotiate with other passengers, and ends up riding hours in the back of a truck with strangers. Still, they end up home at the same time, to a child who was fortunately safe and healthy.

Recommendations

Although the main focus of this report is to shed understanding into how events unfolded, we nevertheless include recommendations to prevent similar incidents from happening in other situations. We divide up the suggestions into two categories: preventative means to avoid being in a rush, and organisational approaches to cope with an emergency rush better.

As Prevention

We've identified the following adjustments that could be done at a minimal cognitive cost and may make a difference:

  • Change the batteries in the alarm clock at the same time the family changes the batteries in their smoke detectors. Rather than asking to do this right before travel time (along with a ton of small items that compete for attention), blending battery-changing tasks with other ones may end up creating an overall easier way to make sure they're always charged. Additionally testing that the alarm clock's battery failover works at the same time would also ensure it is functional at a non-critical time and give ample time to fix things.
  • Finding a relative or friend who is trusted and can either house-sit or just make periodic rounds around the home; this is a general recommendation by insurance companies to prevent or detect water damage early. In this case, this practice would also give the family a point of contact who was expected to be present (and not travelling out) and could have helped come to a faster resolution.
  • If possible, assigned seating (either in shuttles or airplanes) with everyone aware of their neighbour would create a de-facto buddy system and be more reliable than a headcount since it would also act as an identity check.
  • Preparing written checklists of items to do and validate during the trip. One of the challenges of that day was in juggling all the tasks in their mind. Having the lists be prioritized may also help naturally cover important tasks while making it easier to shed the less important ones.

It's interesting to point out that the checklist wouldn't have solved this incident in this case; the headcounts were successfully (albeit incorrectly) completed. Similarly, the other preventative measures only lower the chance of such incidents from happening but do not negate it: a family could be staying at a different place where their clock isn't available, a last-minute airplane model change could throw off buddy systems, and a family member might get lost elsewhere than their home (i.e. at the airport or at destination).

In Action

We have special interest in providing measures that could be used in a rush. It is important to point out that any method that is seen as a "redundant" check that takes more time is likely to be ignored by people in a rush: they did a check, they're short on time, and things are gonna be good enough. As such, we avoid suggesting anything which is equivalent to "take more time and think harder" since they are not considered useful for high-pressure/high-stakes circumstances, and tend to be policies geared for work-as-imagined rather than work-as-done.

There is in fact a single recommendation the author could come up with that might help prevent another home alone situation: redundant awareness and responsibility. We propose what is in effect a "pod" system: each adult is assigned two to three children (ideally theirs since they already have an inclination to care for them), and each child in the pod is made aware of who else is with them. People in a given pod should put their luggage together, have their tickets handled by their responsible adult, and sit together as much as possible.

Having this shared understanding where people in a group have to cover each other lowers the amount of things they have to keep in check within a large group, minimizes the chance of adversarial relationships between relatives from interfering with the process, and lowers chances of plausible deniability around seating arrangements. It would also—we hope—increase chances of successful early detection of anything going wrong, from missing luggage to lost children at any step of the way, whether at home or the airport.


Comparison

So that's it for my post-incident review. I believe that having a focus on how things were done and how they unfolded has highlighted that the vast array of checks put in place are likely going beyond what most people would do already.

The preventive fixes I suggested are contextual and could all also fail individually, as pointed out in the report, but I believe that they are far less brittle than just suggesting things like "change the battery before leaving", "don't leave a milk container open", or "don't leave tickets in the kitchen area" which would come as rather easy potential suggestions from a five whys approach. There's also an easy suggestion to make in trying to reduce the amount of conflicts within siblings and cousins of the family, but I do not feel it is very realistic to provide a fix for that (spoiler alert, most of these easy checks that are done through looking backwards at the incidents and finding faults tend to lead to these easy but brittle fixes that wouldn't have worked in a second movie).

There are much simpler scenarios where things could have failed, such as a kid making it to a correct headcount, then having to go to the bathroom before leaving, and doing so with no adults seeing them. This, once again, could have happened either at home, at the airport, or anywhere in-between. Emergency and rush situations can be created easily as well: one of the shuttles could have gotten a flat tire, a pick-pocket or accidental dropping of ticket while accessing other papers, someone could have injured themselves on ice, the fallen branch that destroyed the power line could have laid across the street and slowed things down, connecting flights could have been delayed, someone could lose a bag, and so on.

The movie demonstrated that the family, in general, was quite apt at verifying things, delegating authority, keeping in sync, and finding multiple solutions once a problem was identified. They were ready to a greater extent than a lot of people I know (props to the mom for carrying this large of an address book to go meet family), but knowledge of the outcome (they forget a kid) tends to define judgment (they have to have been irresponsible!)

I believe the most effective option is the pod system, which is a strategy that would highlight breakdowns faster and allow prompt action at limited cognitive cost with minimal planning. Having multiple people involved in each group with redundant responsibilities increases the likelihood that any one of them notices what might be wrong, especially if the adults are overwhelmed dealing with things only they can do. Specifically, this approach shows promise regardless of what happens to an individual, not just being forgotten at home.

Finally I would like to compliment Kevin on his solid infosec skills in never mentioning he was left home alone, despite that potentially speeding up someone helping him given the circumstances. I am also a bit disconcerted how not one but two dads could potentially make a plan by which they were not all at the airport at least 5-8 hours ahead of their flights, which would have had everyone wake up before 4am and have avoided the power outage, but this is meant to be a blameless report after all.

Permalink

Erlang: Writing a Tetris clone Part 3 – Gameplay rules, final features and deployment

The third video in this series moves on with implementation of gameplay rules, scoring, the “next” preview window and packaging and deploying the game using ZX as a shortcut.

Writing this is a lot of fun. As of this video the game part is finished, but there are a few things such as high score recording and maybe some network features that have not been implemented. I might make a fourth video that covers these but it might be more interesting to move on with another example to demonstrate techniques to accomplish similar tasks.

Quite a few details that were noticed and mentioned in the first two videos have been updated in the course of completing the code for this video, so it may be interesting to check the commit history if you’ve been following along.

As always, have fun making stuff!

Permalink

An introduction to RabbitMQ - What is RabbitMQ?

Why Rabbit? What is MQ? How can it improve our applications? Why would I want to learn more about it? — These are questions that I asked when I was first introduced to RabbitMQ. I’m Gabor, and I’m now a RabbitMQ engineer and consultant. In my time working with RabbitMQ, I’ve learned that even experienced customers ask these questions.

TL;DR > Check out our RabbitMQ product page for use cases, features and to contact our expert consultants.

What problem does RabbitMQ solve?

Before we delve into what RabbitMQ is and how to use it, it is worth learning more about the problem domain itself. Communication between different services (a.k.a computers) is an age-old problem.

On one hand, there are the different protocols defining the means of transportation and the properties of the communication. Some examples of such protocols include SMTP, FTP, HTTP or WebSockets (to name a few), which are all based on TCP/UDP. They deal with the formatting, reliability and finding the correct recipient of a message.

On the other hand, we can explore the communication from the perspective of the message. It exists in one system, then it is transported to another, gets transformed, that is, it has a lifecycle. As it travels from one system to another, we should be aware of where the message is, and who owns it at any given point of time.

The communication protocols mentioned above can make sure that the ownership (and the “physical” location) of the message is transferred from one system to the other (although it may take some time to execute this transaction). We can consider the transfer to be a transaction between the two parties while both are present. Most of the time, this active exchange is desirable, e.g. asking for the status of the service and expecting a timely and accurate answer. An example from the physical world would be calling somebody over the phone:
1) we start the call,
2) wait for the other party to answer,
3) have a nice discussion,
4) hang up the phone.

But there are other times when we don’t need the answer, we just need the receiver to take ownership of the message and do its job. In this case, we need an intermediary agent, another system to take ownership of the message (temporarily) and make sure that the message reaches its destination. To push the phone example further, the other party is not available at the moment, so we leave a voice message. The voicemail service will notify the intended receiver.

This asynchronous (delayed) message delivery is what RabbitMQ provides. Obviously, it can do more than a simple answering machine, so let’s explore some of the options it provides below

(If you are interested in learning more about the history of RabbitMQ, I recommend the first chapter of “RabbitMQ in Action” by Alvaro Videla and Jason Williams. It will reveal the answer to why it is named after Rabbits).

What is RabbitMQ?

RabbitMQ is a free, open-source and extensible message queuing solution. It is a message broker that understands AMQP (Advanced Message Queuing Protocol), but is also able to be used with other popular messaging solutions like MQTT. It is highly available, fault tolerant and scalable. It is implemented in Erlang OTP, a technology tailored for building stabe, reliable, fault tolerant and highly scalable systems which possess native capabilities of handling very large numbers of concurrent operations, such as is the case with RabbitMQ and other systems like WhatsApp, MongooseIM, to mention a few.

At a very high level, it is a middleware layer that enables different services in your application to communicate with each other without worrying about message loss while providing different quality of service (QoS) requirements. It also enables fine-grained and efficient message routing enabling extensive decoupling of applications.

Use cases

To show off the versatility of RabbitMQ, we are going to use three case studies that demonstrate how RabbitMQ is well suited as a black-box managed service approach, as one that integrates tightly with the application enabling a well-functioning micro-service architecture, or as a gateway to other legacy projects.

RabbitMQ as a general message-bus

When a monolith system is broken down to separate subsystems, one of the biggest problems that needs solving is which communication technology to use. A solution like Mulesoft, or MassTransit can “wire” services by declaring HTTP listeners and senders. This kind of solution treats RabbitMQ as a black box, but is still able to leverage the capabilities of RabbitMQ. As an example of direct communication, let’s use HTTP to “connect” the individual services. While it is a well-supported and solid choice, it has some drawbacks:
1) The discovery of services is not solved. A possible solution is to use DNS. As the system scales and grows, so too does the complexity of finding and balancing this load. RabbitMQ can mitigate the increased complexity of the solution.
2) The communication is ephemeral. Messages are prone to being dropped or duplicated on the network layer. If a service is unavailable temporarily, the delivery fails.

RabbitMQ can help in both cases by utilising message queues as a means of transport. Services can publish and consume messages, which decouples the end-to-end message delivery from the availability of the destination service. If a consuming service is temporarily unavailable, unlike HTTP, the message is safely buffered and retained in RabbitMQ, and eventually delivered when the service comes back online.

load testing diagram

Discoverability is simplified too. All we need to know is where RabbitMQ is and what the queue name is. Although it seems like this just reinvents the problem, this is scalable. The queue name acts as the address of the service. Consuming messages from the queues by the individual services offer a means for scalability, i.e. each queue can serve multiple consumers and balance the load. There is no need to change the queue configuration already built into the services.

This moderately static queue configuration pushes RabbitMQ to a middleware layer where a solid design can guarantee a stable service quality in the long term.

RabbitMQ as an advanced routing layer for micro-services

On the other end of the spectrum is an architecture which is more fluid and adapts to the ever-changing needs of many micro-services. What makes RabbitMQ shine in this environment is the very powerful routing capabilities it provides.

The routing logic is implemented in different (so-called) exchange types that can be dynamically created by the application when needed. The destination services create the queues that they wish to consume from, then bind them to exchanges by specifying a pattern for the keys the publishers can use when publishing the message. (Think about these keys as metadata that the exchanges can use to route and deliver the messages to one or more queues.)

load testing diagram

RabbitMQ comes with four useful exchange types that cover most of the use-cases for messaging:
1) Direct exchange. This will deliver the incoming message to any queue whose binding key exactly matches the routing key of the message. If you bind the queues with the queue name as routing keys, then you can think about it as a one-to-one message delivery. It is simple to deliver the same message to multiple queues by using the binding keys for multiple queues.
2) Topic exchange. This will deliver the incoming message to any queue whose wild-card binding key matches the routing key of the published message. Binding keys can contain wild-card matching criteria for a compound routing key. (e.g. the binding key logs.*.error will match the routing keys logs.accounting.error and logs.ui.error). This enables us to write simple services where the logic is well contained, and the message will arrive to the correct services through the “magic” of RabbitMQ.
3) Fanout exchange. Some messages need to be delivered to all queues, this is where a fanout exchange can be used instead of writing an elaborate multicast logic in the application. With a RabbitMQ fanout exchange, each service binds the appropriate queue to the exchange without need to specify a binding key, and it all happens automatically. If a binding key is specified, the fanout exchange will simply ignore it and still route/broadcast messages to all queues bound to it.
4) Headers exchange. This exchange leverages the structure of AMQP messages and is capable of complex routing based on the headers (including custom ones) of the AMQP message. Headers are metadata attached to each message sent via AMQP.

In addition to exchanges, there are other useful features in RabbitMQ which enable the implementation of very complex messaging logic. Some of the most important features include:
1) Custom plug-ins. RabbitMQ is extensible by allowing its users to add plug-ins. Almost every aspect of RabbitMQ is customisable, including the management, authentication and authorisation, back-up solutions, and clustering.
2) Clustering. When a single RabbitMQ server is not enough, multiple RabbitMQ brokers can be connected to work together and scale the system. It can enable RabbitMQ to process more messages or increase resilience to errors.
3) Quality of Service tuning. Time-sensitive message delivery can be helped by attaching a TTL (Time-to-Live) value to either the message or the queue. Timed out messages can be automatically delivered to a Dead-letter queue. Combining ordinary routing logic and these extra features can lead to highly advanced routing logics. Another useful feature is using priority queues where the publisher can assign a priority level to each message. It is also possible to limit the number of unacknowledged messages, which allows for the performance tuning of the consuming services, in this case, RabbitMQ applies a back-pressure mechanism.

RabbitMQ integrated into legacy systems

In the previous use-case, I mentioned the possibility of using plug-ins to extend the functionality of RabbitMQ. This powerful feature allows RabbitMQ to act as a mediation layer between your RabbitMQ native (AMQP capable) services and other legacy applications. Some notable examples include:
1) Using RabbitMQ as an MQTT broker by simply enabling a plug-in. This opens up the landscape to many IoT technologies.
2) RabbitMQ’s JMS (Java Message Service) plug-in, which allows RabbitMQ to communicate with any JMS capable messaging solution.
3) If your application is using a proprietary protocol for communicating, it is possible to develop a custom plugin to connect to any such services.

Conclusion

As the above examples demonstrate, there is hardly anything that RabbitMQ can’t communicate with. But as with anything in life, it has a price. Although configuring RabbitMQ is mostly straightforward, sometimes the mere number of features can be overwhelming. If you face any problems with designing, implementing or supporting your RabbitMQ brokers, reach out to our expert team here. Or, if you’d like to kick start your career in one of the most in-demand technologies sign up for our 3-day RabbitMQ training course.

Debugging RabbitMQ

Want an intuitive system that makes monitoring and maintenance of your RabbitMQ easy? Get your free 45 day trial of WombatOAM now.

We thought you might also be interested in:

Our new online training

Our RabbitMQ solutions

How to add value to your app or website with instant messaging

Permalink

Erlang: Writing a Tetris clone Part 2 – Gameplay mechanics

Last night I was able to make the second video in my series about implementing a Tetris clone in Erlang. Yay!

In this video I start where I left off in the first video where I had ended with a data abstraction to represent the play field (called the “well” in Tetris lingo), a data abstraction for the game pieces, some colored sprites to draw the game board with, and a GUI that could draw a game board and a single random piece every time it was opened as well as print to stdout any key press events that I made.

Oh but by the way…

I have a small confession to make — I didn’t actually start where I left off. I mentioned the stopping point (the “Draw the board with a random piece on it” commit), then I mentioned the next thing I did which was implement basic movement (the “Basic (unsafe) movement” commit), then I completely blew past it and never explained the way I implemented unsafe movement in the first place. The trouble with having skipped that is that I had intended to discuss how keystroke capture actually works, where it is in the GUI code, and follow the event through the system so that people could get that idea into their head earlier than later because it is so basic to making a game!

So instead I’ll explain how that works and point out where it is in the code here in this post and cover it at the beginning of the third video.

Getting Input

If we look at the “Basic (unsafe) movement” commit there is a file called ertltris/src/et_gui.erl that is the code for the GUI process. In the init/1 function we see the wx server get started, a “frame” is created (the main window in wx parlance), some various widget elements and things are all established and on line 112 we see this:

ok = wxPanel:connect(Frame, char_hook),

This is connecting the Frame to a window manager event called char_hook. I do mention this in the video, but it is important to point this out here. I should also point out that I’m mistakenly calling wxPanel:connect/2 instead of wxFrame:connect/2 — which is technically incorrect, but due to the nature of the underlying inheritance among the C++ classes that make up wx and the way that’s all masked in the generated library wrappers that make up wxErlang, it actually doesn’t cause any errors. wxPanel and wxFrame are all ancestors of wxEventHandler.

Anyway… what connecting to this event does is tells wx that whenever the frame is the focus it should relay keystroke events to the program. Inside Erlang they arrive as messages that carry a #wx{} record that includes an event element that carries another record that provides all the relevant information about the event. You can receive these in the handle_event/2 callback function of wx_object. Pretty nice. This means you can deal with GUI events in very much the same way you can handle network events as well as inter-process messages within the Erlang runtime: everything is a message.

IDIOMS IDIOMS IDIOMS!

As I often say: Develop idioms.

A common idiom I use in wxErlang code is to match on an event type in handle_event/2, assign any relevant event data to variables, and then within a case inside the handle_event clause that matched that event type determine where we want to dispatch the event (if at all). You can see a very clear example of this on lines 191-203 of this commit.

handle_event(#wx{event = #wxKey{keyCode = Code}}, State = #s{frame = Frame}) ->
    ok =
        case Code of
             32 -> et_con:random_piece();
             88 -> et_con:rotate(l);
             90 -> et_con:rotate(r);
            314 -> et_con:move(l);
            315 -> et_con:rotate(r);
            316 -> et_con:move(r);
            317 -> et_con:move(d);
            _  -> tell("KeyCode: ~p", [Code])
        end,
    {noreply, State};

I could have made each one of those dispatch decisions inside the function head instead of using a case statement within a clause, but I find it much easier to read this particular style where we open a very compact dispatch span per event type we’re looking for than have a ton of handle_event clauses. In complex GUI applications you can wind up with a lot of special keystroke events and it becomes super cumbersome to match them all as function heads. Not just writing the function heads, but trying to find a specific event gets kind of messy because the code starts looking so scrambled.

Each of the different key codes matched in the dispatching part of that function correspond to some gameplay event. Note that the GUI doesn’t really care what the status of the game is at all, it just cares that it is relaying the mapped commands to the game control process and then carries on doing GUI stuff.

This is important to point out: Nearly all communication between the controller and the GUI is asynchronous. You almost never want blocking calls to pass between them. There may be a need for a blocking call in some special code, but this is almost always a bad idea.

Buh bye!

That pretty much sums up what I wanted to cover that I forgot to mention in the video. Input handling is really important. If you want to jump ahead, check out the way the handle_event/2 function evolved in the latest commits to see how menu commands are intercepted, or go check out the handle_event/2 function in the Erlang Minesweeper clone!

Don’t forget to give me all the delicious likes and stars and channel subs! BWAHAHA! I’ll catch you magnificent nerds in the next video!

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.