The complete guide to Instant Messaging and in-application chat.

What you need to know about Instant Messaging and chat applications

Have you got the message? Chat is a critical feature for almost every business, in virtually every industry. Now, more than ever, digital communication is relied upon to share information and keep our contacts and users in touch. We’ve created bespoke chat applications for use cases as varied as large scale medical or digital health providers, industry-leading financial service providers and modern dating apps. For business-to-consumer uses, chat is a great way to turn your app or business into a community, keeping users engaged and adding a social element to your applications. On the other hand, in the B2B space, chat applications can be used to increase collaboration and productivity. In fact, external research conducted by one of our clients TeleWare found that instant messaging was the most in demand feature for a financial service app.

In this blog, we’ll look at some of the key considerations for an Instant Messaging service as well as the must-have features of the modern chat application and how MongooseIM 4.0 stacks up to deliver what you need.

Build vs buy

One of the first considerations a company needs to make when implementing a chat offering is whether to use an out-of-the-box product-as-a-service or software-as-a-service offering or build your own chat. Below we weigh up the pros and cons of each approach.


load testing diagram

Benefits of buying

The key benefits of an out-the-box purchase solution is that you are able to deploy quickly. The bigger players in this space often offer a comprehensive set of integrations and require little to no development from your team. They also provide users with a familiar user-interface, which means they’re incredibly quick for anyone to learn how to use. All of this means you can be up-and-running quickly with the peace-of-mind that you’re using a tried and tested solution.

Cons of buying

Both product-as-a-service and software-as-a-service options create the ongoing overhead of a subscription fee by their very nature. Over time, this cost inevitably adds up, making it a more expensive offering. Another drawback is that bought options are designed as one-size-fits-all products and seldom offer flexibility for bespoke features and changes. These options offer next to no control and data ownership is often shared. This makes it hard for your users to control their privacy and hard for your chat solution to meet any needs other than the most vanilla offering.The customer service and support can also be variable. All of this creates a huge potential for complication if something stops functioning in what is essentially a blackbox solution.

Benefits of building

Building provides you with the flexibility to create a specific chat solution for your needs and own every step in the functionality. In theory, building can be more affordable over the long-term as it reduces the ongoing costs of a software-as-a-service offering. An owned solution also minimises the risk of major changes in your chat application no longer being compitable with the rest of your application.

Cons of building

When building goes wrong, it is the most costly option, with high upfront and ongoing maintenance costs. Building your own chat application can run into difficulties when the app starts to scale (which is exactly when you want them least). Lastly, building something bespoke means there is no support or community to help you troubleshoot.

The MongooseIM way

MongooseIM is a massively scalable, battle tested, open-source chat server that has been proven to be able to handle 10’s of millions of connections with ease. With MongooseIM you have the freedom and flexibility to use the open-source product and develop it to your needs, or you can engage our experts to build bespoke features for you. Our team also offers the peace-of-mind of support should you ever need it. This gives you the freedom and flexibility to develop and own your chat solution without the cost or risk of starting from scratch.


load testing diagram

The most desired features in a chat application

With over a decade’s experience in building chat applications, we know the features required to ensure a success, taking everyone from the end-user to the DevOps team into consideration. Below is a list of the most used and desired features and how MongooseIM stacks up.

Real-time chat

It goes without saying that a chat application should allow users to reliably send and receive messages in real-time. MongooseIM’s scalability ensures that no matter what the spikes or loads of your user-base is, no important message will be lost in transit.

Push notifications

Push notifications are one of the most valuable parts of a modern chat application. Even if your user is not logged into the application, they’ll still be informed of the message. For B2C applications, that increases the chances of bringing them back to your app and for B2B applications, it ensures no important message goes missed, without requiring your team to be logged into a chat application at all times. MongooseIM has an in-house developed push notification management system, MongoosePush, which is designed to integrate with MongooseIM to easily enable push notifications for your chat app.

External integrations

MongooseIM rarely works alone, usually it is coupled with other microservices. We offer a rest API that these services can call, and an event pusher for MongooseIM to notify them, thus providing a two-way communication with other microservices over the REST API.

API

An easy to use API makes your chat application faster and easier to embed and integrate into your chat. We offer a REST API, which is simple, modern and easily understood by most developers. This can be used for both backend integration and client / service development.

Multi-user Chat

Group chat is one of the most popular features in social settings, and one of the most in-demand features for business collaboration. MongooseIM offers a multi-user chat functionality that is reliable and seamless for users whilst minimising demands on your server. We also provide a light-weight implementation of multi-user chat, tailored for mobile devices.

File Transfer and sharing

For a majority of use cases, allowing users to share and transfer files makes a chat more usable, keeping them engaged on your platform longer. MongooseIM uses an out-of-band transfer method which reduces the workload on the server side whilst still enabling an easier to use experience for users to share files within the chat application.

Batch permission

Batch permissions allow for privacy and control of access to information. MongooseIM uses access control lists to offer this functionality. Our chat applications have been approved by regulatory bodies in the health care and financial services worldwide.

Contact management

As an application built in XMPP, MongooseIM uses the tried and tested mod_roster functionality to allow for users to manage and customise their address books within the chat application.

History and version control

If something goes wrong, history and version control is vital. Having access to previous versions means you always have a proven version to fall back on. MongooseIM has a public history of its source code which you have access to at all times.

Contact sharing

Contact sharing from within a chat application encourages connections between groups of users, helps to grow user bases and increase collaboration.

Four key MongooseIM integrations

Instant Messaging and Kubernetes

Kubernetes has become an extremely popular platform-agnostic deployment tool and has powerful cloud management automation. The MongooseIM Helm Chart makes it easy to install MongooseIM and MongoosePush to Kubernetes.

Structured Log Management for chat solutions

Humio is a modern log management tool that provides complete observability to your team. Our new structured logging allows you to integrate with log management tools just like Humio to identify, prevent and resolve bottlenecks, poor usage patterns in production and other recurring issues in your system.

Instant Messaging metrics

WombatOAM is another tool to help you understand what is going on under-the-hood in your system. WombatOAM specialises in giving you visibility on the metrics of your system so you can identify trends and prevent problems arising. This includes allowing you to create automated alarms based on customisable performance metrics such as CPU usage.

Aysnchrounous message delivery

In complex systems RabbitMQ can be used as an asynchronous message broker. MongooseIM is able to handle the instant messaging between users’ smartphone while RabbitMQ connects these devices to other software systems.

Make sure your users get the message

MongooseIM 4.0 has just been released. In this release, we’ve gone a step further to ensure an easy to use product for developers, users and a DevOps alike. Explore the changes on GitHub

If you need help with the perfect chat solution for your needs, talk to our team of scalability experts. We’re always happy to help.

You may also like:

MongooseIM - page

How to add messaging - webinar

Testing the scalability of MongooseIM - blog

How we can add value to your product with chat - blog

Permalink

You Reap What You Code

2020/10/20

You Reap What You Code

This is a loose transcript of my talk at Deserted Island DevOps Summer Send-Off, an online conference in COVID-19 times. One really special thing about it is that the whole conference takes place over the Animal Crossing video game, with quite an interesting setup.

It was the last such session of the season, and I was invited to present with few demands. I decided to make a compressed version of a talk I had been mulling over for close to a year, and had lined up for at least one in-person conference that got cancelled/reported in April and had given in its fill hour-long length internally at work. The final result is a condensed 30 minutes that touches all kinds of topics, some of which have been borrowed from previous talks and blog posts of mine.

If I really wanted to, I could probably make one shorter blog post out of every one or two slides in there, but I decided to go for coverage rather than depth. Here goes nothing.

'You Reap What You Code': shows my character in-game sitting at a computer with a bunch of broken parts around, dug from holes in the ground

So today I wanted to give a talk on this tendency we have as software developers and engineers to write code and deploy things that end up being a huge pain to live with, to an extent we hadn't planned for.

In software, a pleasant surprise is writing for an hour without compiling once and then it works; a nasty surprise is software that seems to work and after 6 months you find out it poisoned your life.

This presentation is going to be a high level thing, and I want to warn you that I'm going to go through some philosophical concerns at first, follow that up with research that has taken place in human factors and cognitive science, and tie that up with broad advice that I think could be useful to everyone when it comes to system thinking and designing things. A lot of this may feel a bit out there, but I hope that by the end it'll feel useful to you

'Energy and Equity; Ivan Illich' shows a screenshot of the game with a little village-style view

This is the really philosophical stuff we're starting with. Ivan Illich was a wild ass philosopher who hated things like modern medicine and mandatory education. He wrote this essay called "Energy and Equity" (to which I was introduced by reading a Stephen Krell presentation), where he decides to also dislike all sorts of motorized transportation.

Ivan Illiches introduces the concept of an "oppressive" monopoly; if we look at societies that developed for foot traffic and cycling, you can generally use any means of transportation whatsoever and effectively manage to live and thrive there. Whether you live in a tent or a mansion, you can get around the same.

He pointed out that cycling was innately fair because it does not require more energy than what is required as a baseline to operate: if you can walk, you can cycle, and cycling, for the same energy as walking, is incredibly more efficient. Cars don't have that; they are rather expensive, and require disproportionate amounts of energy compared to what a basic person has.

His suggestion was that all non-freight transport, whether cars or busses and trains, be capped to a fixed percentage above the average speed of a cyclist, which is based on the power a normal human body can produce on its own. He suggested we do this to prevent...

Aerial stock photo of an American suburb

that!

We easily conceived cars as ways to make existing burdens easier: it created freedoms, widened our access to goods and people. It was a better horse, and a less exhausting bicycle. And so society would develop to embrace cars in its infrastructure.

Rather than having a merchant bring goods to the town square, the milkman drop milk on the porch, and markets smaller and distributed closer to where they'd be convenient, it is now everyone's job to drive for each of these things while stores go to where land is cheap rather than where people are. And when society develops with a car in mind, you now need a car to be functional.

In short the cost of participating in society has gone up, and that's what an oppressive monopoly is.

'The Software Society': Van Bentum's painting The Explosion in the Alchemist's Laboratory

To me, the key thing that Illich did was twist the question another way: what effects would cars have on society if a majority of people had them, and what effect would it have on the rest of us?

The question I now want to ask is whether we have the equivalent in the software world. What are the things we do that we perceive increase our ability to do things, but turn out to actually end up costing us a lot more to just participate?

We kind of see it with our ability to use all the bandwidth a user may have; trying to use old dial-up connections is flat out unworkable these days. But do we have the same with our cognitive cost? The tooling, the documentation, the procedures?

'Ecosystems; we share a feedback loop': a picture of an in-game aquarium within the game's museum

I don't have a clear answer to any of this, but it's a question I ask myself a lot when designing tools and software.

The key point is that the software and practices that we choose to use is not just something we do in a vacuum, but part of an ecosystem; whatever we add to it changes and shifts expectations in ways that are out of our control, and impacts us back again. The software isn't trapped with us, we're trapped with the software.

Are we not ultimately just making our life worse for it? I want to focus on this part where we make our own life, as developers, worse. When we write or adopt software to help ourselves but end up harming ourselves in the process, because that speaks to our own sustainability.

'Ironies of automation; (Bainbridge, 1983): A still from Fantasia's broom scene

Now we're entering the cognitive science and human factors bit.

Rather than just being philosophical here I want to ground things in the real world with practical effects. Because this is something that researchers have covered. The Ironies of automation are part of cognitive research (Bainbridge, 1983) that looked into people automating tasks and finding out that the effects weren't as good as expected.

Mainly, it's attention and practice clashing. There are tons of examples over the years, but let's take a look at a modern one with self-driving cars.

Self-driving cars are a fantastic case of clumsy automation. What most established players in the car industry are doing is lane tracking, blind spot detection, and handling parallel parking.

But high tech companies (Tesla, Waymo, Uber) are working towards full self-driving, with Tesla's autopilot being the most ambitious one being released to the public at large. But all of these right now operate in ways Bainbridge fully predicted in 1983:

  • the driver is no longer actively involved and is shifted to the role of monitoring
  • the driver, despite no longer driving the car, regardless must be fully aware of everything the car is doing
  • when the car gets in a weird situation, it is expected that the driver takes control again
  • so the car handles all the easy cases, but all the hard cases are left to the driver

Part of the risk there is twofold: people have limited attention for tasks they are not involved in—if you're not actively driving it's going to be hard to be attentive for extended periods of time—and if you're only driving rarely with only the worst cases, you risk being out of practice to handle the worst cases.

Such automation is done in airlines who otherwise make up for it in simulator hours, and still manually handling planned difficult areas like takeoff and landing. Still, a bunch of airline incidents discover that this hand-off is often complex and not going well.

Clearly, when we ignore the human component and its responsibilities in things, we might make software worse than what it would have been.

'HABA-MABA problems': a chart illustrating Fitt's model using in-game images

In general most of these errors come from the following point of view. This is called the "Fitts" model, also "HABA-MABA", for "Humans are better at, machines are better at" (the original version was referred as MABA-MABA, using "Men" rather than "Humans"). This model frames humans as slow, perceptive beings able of judgement, and machines are fast undiscerning indefatigable things.

We hear this a whole lot even today. These things are, to be polite, a beginner's approach to automation design. It's based on scientifically outdated concepts, intuitive-but-wrong sentiments, and is comforting in letting you think that only the predicted results will happen and totally ignores any emergent behaviour. It operates on what we think we see now, not on stronger underlying principles, and often has strong limitations when it comes to being applied in practice.

It is disconnected from the reality of human-machine interactions, and frames choices as binary when they aren't, usually with the intent of pushing the human out of the equation when you shouldn't. This is, in short, a significant factor behind the ironies of automation.

'Joint Cognitive Systems': a chart illustrating the re-framing of computers as teammates

Here's a patched version established by cognitive experts. They instead reframe the human-computer relationship as a "joint cognitive system", meaning that instead of thinking of humans and machines as unrelated things that must be used in distinct contexts for specific tasks, we should frame humans and computers as teammates working together. This, in a nutshell, shifts the discourse from how one is limited to terms of how one can complement the other.

Teammates do things like being predictable to each other, sharing a context and language, being able to notice when their actions may impact others and adjust accordingly, communicate to establish common ground, and have an idea of everyone's personal and shared objectives to be able to help or prioritize properly.

Of course we must acknowledge that we're nowhere close to computers being teammates as the state of the art today. And since currently computers need us to keep realigning them all the time, we have to admit that the system is not just the code and the computers, it's the code, the computers, and all the people who interact with them and each other. And if we want our software to help us, we need to be able to help it, and to help it that means the software needs to be built knowing it will be full of limitations and having us work to make it easier to diagnose issues and form and improve mental models.

So the question is: what makes a good model? How can we help people work with what we create?

'How People From Models': a detailed road map of the city of London, UK

note: this slide and the next one are taken from my talk on operable software

This is a map of the city of London, UK. It is not the city of London, just a representation of it. It's very accurate: it has streets with their names, traffic directions, building names, rivers, train stations, metro stations, footbridges, piers, parks, gives details regarding scale, distance, and so on. But it is not the city of London itself: it does not show traffic nor roadwork, it does not show people living there, and it won't tell you where the good restaurants are. It is a limited model, and probably an outdated one.

But even if it's really limited, it is very detailed. Detailed enough that pretty much anyone out there can't fit it all in their head. Most people will have some detailed knowledge of some parts of it, like the zoomed-in square in the image, but pretty much nobody will just know the whole of it in all dimensions.

In short, pretty much everyone in your system only works from partial, incomplete, and often inaccurate and outdated data, which itself is only an abstract representation of what goes on in the system. In fact, what we work with might be more similar to this:

A cartoony tourist map of London's main attractions

That's more like it. This is still not the city of London, but this tourist map of London is closer to what we work with. Take a look at your architecture diagrams (if you have them), and chances are they look more like this map than the very detailed map of London. This map has most stuff a tourist would want to look at: important buildings, main arteries to get there, and some path that suggests how to navigate them. The map has no well-defined scale, and I'm pretty sure that the two giant people on Borough road won't fit inside Big Ben. There are also lots of undefined areas, but you will probably supplement them with other sources.

But that's alright, because mental models are as good as their predictive power; if they let you make a decision or accomplish a task correctly, they're useful. And our minds are kind of clever in that they only build models as complex as they need to be. If I'm a tourist looking for my way between main attractions, this map is probably far more useful than the other one.

There's a fun saying about this: "Something does not exist until it is broken." Subjectively, you can be entirely content operating a system for a long time without ever knowing about entire aspects of it. It's when they start breaking or that your predictions about the system no longer works that you have to go back and re-tune your mental models. And since this is all very subjective, everyone has different models.

This is a vague answer to what is a good model, and the follow up is how can we create and maintain them?

'Syncing Models': a still from the video game in the feature where you back up your island by uploading it online

One simple step, outside of all technical components, is to challenge and help each other to sync and build better mental models. We can't easily transfer our own models to each other, and in fact it's pretty much impossible to control them. What we can do is challenge them to make sure they haven't eroded too much, and try things to make sure they're still accurate, because things change with time.

So in a corporation, things we might do include training, documentation, incident investigations all help surface aspects and changes to our systems to everyone. Game days and chaos engineering are also excellent ways to discover how our models might be broken in a controlled setting.

They're definitely things we should do and care about, particularly at an organisational level. That being said, I want to focus a bit more on the technical stuff we can do as individuals.

'Layering Observability': a drawing of abstraction layers and observation probes' locations

note: this slide is explored more in depth in my talk on operable software

We can't just open a so-called glass pane and see everything at once. That's too much noise, too much information, too little structure. Seeing everything is only useful to the person who knows what to filter in and filter out. You can't easily form a mental model of everything at once. To aid model formation, we should structure observability to tell a story.

Most applications and components you use that are easy to operate do not expose their internals to you, they mainly aim to provide visibility into your interactions with them. There has to be a connection between the things that the users are doing and the impact it has in or on the system, and you will want to establish that. This means:

  • Provide visibility into interactions between components, not their internals
  • log at the layer below which you want to debug, which saves time and how many observability probes you need to insert in your code base. We have a tendency to stick everything at the app level, but that's misguided.
  • This means the logs around a given endpoint have to be about the user interactions with that endpoint, and require no knowledge of its implementation details
  • For developer logs, you can have one log statement shared by all the controllers by inserting it a layer below endpoints within the framework, rather than having to insert one for each endpoint.
  • These interactions will let people make a mental picture of what should be going on and spot where expectations are broken more easily. By layering views, you then make it possible to skip between layers according to which expectations are broken and how much knowledge they have
  • Where a layer provides no easy observability, people must cope through inferences in the layers above and below it. It becomes a sort of obstacle.

Often we are stuck with only observability at the highest level (the app) or the lowest level (the operating system), with nearly nothing useful in-between. We have a blackbox sandwich where we can only look at some parts, and that can be a consequence of the tools we choose. You'll want to actually pick runtimes and languages and frameworks and infra that let you tell that observability story and properly layer it.

'Logging Practices': a game character chopping down trees

Another thing to help with model formation is maintaining that relationship between humans and machines going smoothly. This is a trust relationship, and providing information that is considered misleading or unhelpful erodes that trust. There are a few things you can do with logs that can help not ruin your marriage to the computer.

The main one is to log facts, not interpretations. You often do not have all the context from within a single log line, just a tiny part of it. If you start trying to be helpful and suggesting things to people, you change what is a fact-gathering expedition into a murder-mystery investigations where bits of the system can't be trusted or you have to rean between the lines. That's not helpful. A log line that says TLS validation error: SEC_ERROR_UNKNOWN_ISSUER is much better than one that says ERROR: you are being hacked regardless of how much experience you have.

A thing that helps with that is structured logging, which is better than regular text. It makes it easier for people to use scripts or programs to parse, aggregate, route, and transform logs. It prevents you from needing full-text search to figure out what happened. If you really want to provide human readable text or interpretations, add it to a field within structured logging.

Finally, adopting consistent naming mechanisms and units is always going to prove useful.

'Hitting Limits': the game's museum's owl being surprised while woken up

There is another thing called the Law of Requisite Variety, which says that only complexity can control complexity. If an agent can't represent all the possible states and circumstances around a thing it tries to control, it won't be able to control it all. Think of an airplane's flight stabilizers; they're able to cope only with a limited amount of adjustment, and usually at a higher rate than we humans could. Unfortunately, once it reaches a certain limit in its actions and things it can perceive, it stops working well.

That's when control is either ineffective, or passed on to the next best things. In the case of software we run and operate, that's us, we're the next best thing. And here we fall into the old idea that if you are as clever as you can to write something, you're in trouble because you need to be doubly as clever to debug it.

That's because to debug a system that is misbehaving under automation, you need to understand the system, and then understand the automation, then understand what the automation thinks of the system, and then take action.

That's always kind of problematic, but essentially, brittle automation forces you to know more than if you had no automation in order to make things work in difficult times. Things can then become worse than if you had no automation in the first place.

'Handle Hand-Offs First': this in-game owl/museum curator accepting a bug he despises for his collection

When you start creating a solution, do it while being aware that it is possibly going to be brittle and will require handing control over to a human being. Focus on the path where the automation fails and how the hand-off will take place. How are you going to communicate that, and which clues or actions will an operator have to take over things?

When we accept and assume that automation will reach its limits, and the thing that it does is ask a human for help, we shift our approach to automation. Make that hand-off path work easily. Make it friendly, and make it possible for the human to understand what the state of automation was at a given point in time so you can figure out what it was doing and how to work around it. Make it possible to guide the automation into doing the right thing.

Once you've found your way around that, you can then progressively automate things, grow the solution, and stay in line with these requirements. It's a backstop for bad experiences, similar to "let it crash" for your code, so doing it well is key.

'Curb Cut Effect': a sidewalk with the classic curb cut in it

Another thing that I think is interesting is the curb cut effect. The curb cut effect was noticed as a result from the various American laws about accessibility that started in the 60s. The idea is that to make sidewalks and streets accessible to people in wheelchairs, you would cut the part of the curb so that it would create a ramp from sidewalk to street.

The thing that people noticed is that even though you'd cut the curb for handicapped people, getting around was now easier for people carrying luggage, pushing strollers, on skateboards or bicycles, and so on. Some studies saw that people without handicaps would even deviate from their course to use the curb cuts.

Similar effects are found when you think of something like subtitles which were put in place for people with hearing problems. When you look at the raw number of users today, there are probably more students using them to learn a second or third language than people using them with actual hearing disabilities. Automatic doors that open when you step in front of them are also very useful for people carrying loads of any kind, and are a common example of doing accessibility without "dumbing things down."

I'm mentioning all of this because I think that keeping accessibility in mind when building things is one of the ways we can turn nasty negative surprises into pleasant emerging behaviour. And generally, accessibility is easier to build in than to retrofit. In the case of the web, accessibility also lines up with better performance.

If you think about diversity in broader terms, how would you rethink your dashboards and monitoring and on-call experience if you were to run it 100% on a smartphone? What would that let people on regular computers do that they cannot today? Ask the same question but with user bases that have drastically different levels of expertise.

I worked with an engineer who used to work in a power station and the thing they had set up was that during the night, when they were running a short shift, they'd generate an audio file that contained all the monitoring metrics. They turned it into a sort of song, and engineers coming in in the morning would listen to it on fast forward to look for anomalies.

Looking at these things can be useful. If you prepare for your users of dashboards to be colorblind, would customizing colors be useful? And could that open up new regular use cases to annotate metrics that tend to look weird and for which you want to keep an eye on?

And so software shouldn't be about doing more with less. It's actually requiring less to do more. As in letting other people do more with less.

'Complexity Has To Live Somewhere': in-game's 'The Thinker' sitting at a desk, looking like it's pondering at papers

note: this slide is a short version of my post on Complexity Has to Live Somewhere

A thing we try to do, especially as software engineers, is to try to keep the code and the system—the technical part of the system—as simple as possible. We tend to do that by finding underlying concepts, creating abstractions, and moving things outside of the code. Often that means we rely on some sort of convention.

When that happens, what really goes on is that the complexity of how you chose to solve a problem still lingers around. Someone has to handle the thing. If you don't, your users have to do it. And if it's not in the code, it's in your operators or the people understanding the code. Because if the code is to remain simple, the difficult concepts you abstracted away still need to be understood and present in the world that surrounds the code.

I find it important to keep that in mind. There's this kind of fixed amount of complexity that moves around the organization, both in code and in the knowledge your people have.

Think of how people interact with the features day to day. What do they do, how does it impact them? What about the network of people around them? How do they react to that? Would you approach software differently if you think that it's still going to be around in 5, 10, or 20 years when you and everyone who wrote it has left? If so, would that approach help people who join in just a few months?

One of the things I like to think about is that instead of using military analogies of fights and battles, it's interesting to frame it in terms of gardens or agriculture. When we frame the discussion that we have in terms of an ecosystem and the people working collectively within it, the way we approach solving problems can also change drastically.

'Replacing, Adding, or Diffusing?': the trolley problem re-enacted with in-game items

Finally, one of the things I want to mention briefly is this little thought framework I like when we're adopting new technology.

One we first adopt a new piece of technology, the thing we try to do—or tend to do—is to start with the easy systems first. Then we say "oh that's great! That's going to replace everything we have." Eventually, we try to migrate everything, but it doesn't always work.

So an approach that makes sense is to start with the easy stuff to probe that it's workable for the basic cases. But also try something really, really hard, because that would be the endpoint. The endgame is to migrate the hardest thing that you've got.

If you're not able to replace everything, consider framing things as adding it to your system rather than replacing. It's something you add to your stack. This framing is going to change the approach you have in terms of teaching, maintenance, and in terms of pretty much everything that you have to care about so you avoid the common trap of deprecating a piece of critical technology with nothing to replace it. If you can replace a piece of technology then do it, but if you can't, don't fool yourself. Assume the cost of keeping things going.

The third one there is diffusing. I think diffusing is something we do implicitly when we do DevOps. We took the Ops responsibilities and the Dev responsibilities and instead of having it in different areas and small experts in dev and operation, you end up making it everybody's responsibility to be aware of all aspects.

That creates that diffusion where in this case, it can be positive. You want everyone to be handling a task. But if you look at the way some organisations are handling containerization, it can be a bunch of operations people who no longer have to care about that aspect of their job. Then all of the development teams now have to know and understand how containers work, how to deploy them, and just adapt their workflow accordingly.

In such a case we haven't necessarily replaced or removed any of the needs for deployment. We've just taken it outside of the bottleneck and diffused it and sent it to everyone else.

I think having an easy way, early in the process, to figure out whether what we're doing is replacing, adding, or diffusing things will drastically influence how we approach change at an organisational level. I think it can be helpful.

'Thanks': title slide again

This is all I have for today. Hopefully it was practical.

Thanks!

Permalink

A brief introduction to BEAM

This post is a brief primer on BEAM, the virtual machine that executes user code in the Erlang Runtime System (ERTS). It’s intended to help those new to BEAM follow an upcoming series of posts about the JIT in OTP 24, leaving implementation details for later.

BEAM is often confused with ERTS and it’s important to distinguish between the two; BEAM is just the virtual machine and it has no notion of processes, ports, ETS tables, and so on. It merely executes instructions and while ERTS has influenced their design, it doesn’t affect what they do when the code is running, so you don’t need to understand ERTS to understand BEAM.

BEAM is a register machine, where all instructions operate on named registers. Each register can contain any Erlang term such as an integer or a tuple, and it helps to think of them as simple variables. The two most important kinds of registers are:

  • X: these are used for temporary data and passing data between functions. They don’t require a stack frame and can be freely used in any function, but there are certain limitations which we’ll expand on later.
  • Y: these are local to each stack frame and have no special limitations beyond needing a stack frame.

Control flow is handled by instructions that test a certain condition and either move on to the next instruction or branch to its fail label, noted by {f,Index}. For example {test,is_integer,{f,7},[{x,0}]}. checks if {x,0} contains an integer and jumps to label 7 if it doesn’t.

Function arguments are passed from left to right in X registers, starting at {x,0}, and the result is returned in {x,0}.

It’s easier to explain how this fits together through example, so let’s walk through a few:

sum_tail(List) ->
    sum_tail(List, 0).

sum_tail([Head | Tail], Acc) ->
    sum_tail(Tail, Head + Acc);
sum_tail([], Acc) ->
    Acc.

Let’s use erlc -S to look at the instructions one by one:

%% sum_tail/1, entry label is 2
{function, sum_tail, 1, 2}.

  %% Marks a jump target with the label 1.
  {label,1}.

    %% Special instruction that raises a function_clause
    %% exception. Unused in this function.
    {func_info,{atom,primer},{atom,sum_tail},1}.

  {label,2}.
    %% The meat of the function starts here.
    %%
    %% Our only argument - List - is in {x,0} and
    %% since sum_tail/2 expects it to be the first
    %% argument we can leave it be. We'll pass the
    %% integer 0 as the second argument in {x,1}.
    {move,{integer,0},{x,1}}.

    %% Tail call sum_tail/2, whose entry label is 4.
    {call_only,2,{f,4}}.

%% sum_tail/2, entry label is 4
{function, sum_tail, 2, 4}.
  {label,3}.
    {func_info,{atom,primer},{atom,sum_tail},2}.
  {label,4}.

    %% Test whether we have a non-empty list, and jump to
    %% the base case at label 5 if we don't.
    {test,is_nonempty_list,{f,5},[{x,0}]}.

    %% Unpack the list in the first argument, placing the
    %% head in {x,2} and the tail in {x,0}.
    {get_list,{x,0},{x,2},{x,0}}.

    %% Add the head and our accumulator (remember that the
    %% second function argument is in {x,1}), and place
    %% the result in {x,1}.
    %%
    %% A fail label of 0 means that we want the
    %% instruction to throw an exception on error, rather
    %% than jump to a given label.
    {gc_bif,'+',{f,0},3,[{x,2},{x,1}],{x,1}}.

    %% Tail-call ourselves to handle the rest of the list,
    %% the arguments are already in the right registers.
    {call_only,2,{f,4}}.

  {label,5}.
    %% Test whether our argument was the empty list. If
    %% not, we jump to label 3 to raise a function_clause
    %% exception.
    {test,is_nil,{f,3},[{x,0}]}.

    %% Return our accumulator.
    {move,{x,1},{x,0}}.
    return.

Simple enough, isn’t it?

I glossed over one little detail though; the mysterious number 3 in the addition instruction. This number tells us how many X registers hold live data in case we need more memory, so they can be preserved while the rest are discarded as garbage. As a consequence, it’s unsafe to refer to higher X registers after this instruction as their contents may be invalid (in this case {x,3} and above).

Function calls are similar; we may schedule ourselves out whenever we call or return from a function, and we’ll only preserve the function arguments/return value when we do so. This means that all X registers except for {x,0} are invalid after a call even if you knew for certain that the called function didn’t touch a certain register.

This is where Y registers enter the picture. Let’s take the previous example and make it body-recursive instead:

sum_body([Head | Tail]) ->
    Head + sum_body(Tail);
sum_body([]) ->
    0.

.

{function, sum_body, 1, 7}.
  {label,6}.
    {func_info,{atom,primer},{atom,sum_body},1}.
  {label,7}.
    {test,is_nonempty_list,{f,8},[{x,0}]}.

    %% Allocate a stack frame with a single Y register.
    %% Since this instruction may need more memory, we
    %% tell the garbage collector that we currently have
    %% one live X register (our list argument in {x,0}).
    {allocate,1,1}.

    %% Unpack the list, placing the head in {y,0} and
    %% the tail in {x,0}.
    {get_list,{x,0},{y,0},{x,0}}.

    %% Body-call ourselves. Note that while this kills all
    %% X registers, it leaves Y registers alone so our
    %% head is still valid.
    {call,1,{f,7}}.

    %% Add the head to our return value and store the
    %% result in {x,0}.
    {gc_bif,'+',{f,0},1,[{y,0},{x,0}],{x,0}}.

    %% Deallocate our stack frame and return.
    {deallocate,1}.
    return.

  {label,8}.
    {test,is_nil,{f,6},[{x,0}]}.

    %% Return the integer 0.
    {move,{integer,0},{x,0}}.
    return.

Notice how the call instruction changed now that we’re in a stack frame? There are three different call instructions:

  • call: ordinary call as in the example. Control flow will resume at the next instruction when the called function returns.
  • call_last: tail call when there is a stack frame. The current frame will be deallocated before the call.
  • call_only: tail call when there is no stack frame.

Each of these have a variant for calling functions in other modules (e.g. call_ext), but they’re otherwise identical.

So far we’ve only looked at using terms, but what about creating them? Let’s have a look:

create_tuple(Term) ->
    {hello, Term}.

.

{function, create_tuple, 1, 10}.
  {label,9}.
    {func_info,{atom,primer},{atom,create_tuple},1}.
  {label,10}.
    %% Allocate the three words needed for a 2-tuple, with
    %% a liveness annotation of 1 indicating that {x,0}
    %% is alive in case we need to GC.
    {test_heap,3,1}.

    %% Create the tuple and place the result in {x,0}
    {put_tuple2,{x,0},{list,[{atom,hello},{x,0}]}}.
  
    return.

This is a bit magical in the sense that there’s an unseen register for memory allocations, but allocation is rarely far apart from use and it’s usually pretty easy to follow. The same principle applies for lists (consing), floats, and funs as well following PR 2765.

More complicated types like maps, big integers, references, and so on are created by special instructions that may GC on their own (or allocate outside the heap in a “heap fragment”) as their size can’t be statically determined in advance.

Now let’s look at something more uncommon: exceptions.

exception() ->
    try
        external:call()
    catch
        throw:example -> hello
    end.

.

{function, exception, 0, 12}.
  {label,11}.
    {func_info,{atom,primer},{atom,exception},0}.
  {label,12}.
    {allocate,1,0}.
  
    %% Place a catch tag in {y,0}. If an exception is
    %% raised while this tag is the most current one,
    %% the control flow will resume at {f,13} in this
    %% stack frame.
    {'try',{y,0},{f,13}}.

    {call_ext,0,{extfunc,external,call,0}}.

    %% Deactivate the catch tag before returning with the
    %% result from the call.
    {try_end,{y,0}}.

    {deallocate,1}.
    return.

  {label,13}.
    %% Uh oh, we've got an exception. Kill the catch tag
    %% and place the exception class in {x,0}, the error
    %% reason/thrown value in {x,1}, and the stack trace
    %% in {x,2}.
    {try_case,{y,0}}.

    %% Return 'hello' if the user threw 'example'
    {test,is_eq_exact,{f,14},[{x,0},{atom,throw}]}.
    {test,is_eq_exact,{f,14},[{x,1},{atom,example}]}.
    {move,{atom,hello},{x,0}}.
    {deallocate,1}.
    return.

  {label,14}.
    %% Otherwise, rethrow the exception since no catch
    %% clause matched.
    {bif,raise,{f,0},[{x,2},{x,1}],{x,0}}.

By now you’ve probably noticed how the control flow only moves forward; just like Erlang itself the only way to loop is through recursion. The one exception to this is the receive construct, which may loop until a matching message has been received:

selective_receive(Ref) ->
    receive
        {Ref, Result} -> Result
    end.

.

{function, selective_receive, 1, 16}.
  {label,15}.
    {func_info,{atom,primer},{atom,selective_receive},1}.
  {label,16}.
    {allocate,1,1}.

    %% We may be scheduled out while waiting for a
    %% message, so we'll preserve our Ref in {y,0}.
    {move,{x,0},{y,0}}.

  {label,17}.
    %% Pick the next message from the process' message box
    %% and place it in {x,0}, jumping to label 19 if the
    %% message box is empty.
    {loop_rec,{f,19},{x,0}}.
  
    %% Does it match our pattern? If not, jump to label 18
    %% and try the next message.
    {test,is_tuple,{f,18},[{x,0}]}.
    {test,test_arity,{f,18},[{x,0},2]}.
    {get_tuple_element,{x,0},0,{x,1}}.
    {test,is_eq_exact,{f,18},[{x,1},{y,0}]}.

    %% We've got a match, extract the result and remove
    %% the message from the mailbox.
    {get_tuple_element,{x,0},1,{x,0}}.
    remove_message.
    {deallocate,1}.
    return.

  {label,18}.
    %% The message didn't match, loop back to handle our
    %% next message. Note that the current message remains
    %% in the inbox since a different receive may be
    %% interested in it.
    {loop_rec_end,{f,17}}.

  {label,19}.
    %% Wait until the next message arrives, returning to
    %% the start of the loop when it does. If there's a
    %% timeout involved, it will be handled here.
    {wait,{f,17}}.

There’s not much more to it, and if you feel comfortable following the examples above you should have no problems with the JIT series.

If you’re curious about which instructions there are, you can find a brief description of every instruction in genop.tab.

Permalink

Redirecting routes in a Phoenix application using plugs

There are many reasons we might want to redirect routes in our phoenix application such as the name of a domain concept changing or the removal of a page from the application.

As an example, let’s say we have the following route /home and we want to change it to /welcome:

scope "/", AppWeb do
  pipe_through :browser

  get "/home", HomeController, :index
end

One of the most common solutions — and perhaps the most straightforward — is to create a WelcomeController, rename the view module and template file, and redirect from the old controller:

defmodule AppWeb.HomeController do
  use AppWeb, :controller

  def index(conn, _params) do
    welcome_path = Routes.welcome_path(conn, :index)

    conn |> redirect(to: welcome_path) |> halt()
  end
end

And then we can reference both controllers from our router.ex:

scope "/", AppWeb do
  pipe_through :browser

  get "/home", HomeController, :index
  get "/welcome", WelcomeController, :index
end

Functionally, there is nothing wrong with this approach. It gets the job done, and it’s simple to implement. But there are a some significant downsides:

  1. We have to keep both the old controller and the new controller around
  2. If our redirection behavior were even slightly more complex — such as handling a URL parameter — it would feel like the old controller knew too much about the new controller’s implementation
  3. Someone looking at router.ex can’t know whether get "/home", HomeController, :index is rendering a page that exists, or performing a redirection
  4. We can’t test the redirection behavior without writing a controller test

The solution to these problems is to extract the redirection logic and responsibility away from the controller.

Where should that logic live? In a Plug, of course!

Redirecting a single route

The Phoenix.Router.forward/4 macro allows us to forward requests made for a given path to a named plug; for example:

scope "/", AppWeb do
  pipe_through :browser

  get "/welcome", WelcomeController, :index
  forward "/home", Plugs.WelcomePageRedirector
end

Even without knowing the implementation details of Plugs.WelcomePageRedirector, it is still clear from a quick look looking at the route definition that the /home route is performing a redirect.

If we then define Plugs.WelcomePageRedirector in app/lib/app_web/plugs/welcome_page_redirector.ex, a simple implementation might look like this:

defmodule AppWeb.Plugs.WelcomePageRedirector do
  alias AppWeb.Router.Helpers, as: Routes

  def init(default), do: default

  def call(conn, _opts) do
    welcome_path = Routes.welcome_path(conn, :index)

    conn
    |> Phoenix.Controller.redirect(to: welcome_path)
    |> Plug.Conn.halt()
  end
end

The implementation of the plug is almost identical to how we previously implemented the redirect in the controller. But encapsulating the logic into a plug means that we can easily test the plug in isolation from the controller.

Redirecting multiple routes

If you have multiple routes you want to redirect, a single plug for each route is a viable option. And there is value in keeping the plug implementations straightforward to make them easier to understand and test.

But in situations where you have a set of routes you want to redirect in a predictable way, you can write a more dynamic plug.

As an example, let’s say we have a ProfileController that handles all requests to the route /profile/:id/* and we want to rename it to UserController and move its routes to /user/:id/*.

The forward/4 macro forwards all requests starting with the given path, so the following would handle any route starting with /profile:

scope "/", AppWeb do
  pipe_through :browser

  resources "/user", UserController 
  forward "/profile", Plugs.ProfileRedirector
end

And Plugs.ProfileRedirector has access to the request fields from Plug.Conn, allowing us to dynamically redirect the request in our plug implementation based on the user id path parameter:

defmodule AppWeb.Plugs.ProfileRedirector do
  alias AppWeb.Router.Helpers, as: Routes

  def init(default), do: default

  def call(conn, _opts) do
    [id, _tail] = conn.path_info

    user_path = Routes.user_path(conn, :show, id)

    conn
    |> Phoenix.Controller.redirect(to: user_path)
    |> Plug.Conn.halt()
  end
end

The above implementation does not attempt to make all resource routes work. Instead, it simply redirects them to the :show page under the new /user URL path.

This is a trade-off. We could write a complex plug that allows all requests to the old route /profile/:id/* to function as if they were made to /user/:id/*, but we can avoid that complexity if our use case is just to redirect lost users to the new route.

Ideally we want to keep these plugs short and simple, ensuring that they are easy to understand and easy to test.

Permalink

Instrumenting your Phoenix application using telemetry

Instrumenting an application can be a huge undertaking, even before trying to decide what to measure and when to measure it.

Getting started with instrumentation can be intimidating because we often try to group together three separate concerns:

  1. How will we collect measurements?
  2. What should we measure?
  3. Where will we store those measurements?

In reality, building a good foundation around How will we collect measurements? is all you need to get started.

Just as it would be foolhardy to try to map out all the features of our application without understanding what our users need, trying to answer the question of What should we measure? upfront will lead us to measure the wrong things. We can table this question for now, and let our future performance smells and analytics needs inform our answer when we are ready.

And we don’t need to answer Where will we store those measurements? right away, because that’s a storage implementation detail.

A good foundation around how we collect measurements will allow us to easily add and remove measurements in the future, and switch out the storage implementation with minimal code changes.

Let’s build that foundation for a Phoenix application. Enter: telemetry.

What is telemetry?

Telemetry is a dynamic dispatching library for metrics and instrumentations. It is lightweight, small and can be used in any Erlang or Elixir project.

In practice, the :telemetry module allows us to emit events (optionally with a data payload), and register functions to be called when events are emitted.

The Telemetry project aims to standardize the process of introspection by allowing libraries and frameworks to expose a set of telemetry events which can then be captured for the purposes of debugging, measurement, or logging.

Some common libraries which expose telemetry events include:

Our application can also emit custom events, which can be captured in the same way as events emitted by libraries that use telemetry.

Up and running with Phoenix

Since Phoenix 1.5, applications come preconfigured with telemetry. For older applications, it can be manually added.

First, add the following ad dependencies in mix.exs:

{:telemetry_metrics, "~> 0.4"},
{:telemetry_poller, "~> 0.4"}

Next, create a telemetry supervisor at lib/app_web/telemetry.ex; the below is identical to what is generated by default since Phoenix 1.5:

defmodule AppWeb.Telemetry do
  use Supervisor
  import Telemetry.Metrics

  def start_link(arg) do
    Supervisor.start_link(__MODULE__, arg, name: __MODULE__)
  end

  @impl true
  def init(_arg) do
    children = [
      # Telemetry poller will execute the given period measurements
      # every 10_000ms. Learn more here: https://hexdocs.pm/telemetry_metrics
      {:telemetry_poller, measurements: periodic_measurements(), period: 10_000}
      # Add reporters as children of your supervision tree.
      # {Telemetry.Metrics.ConsoleReporter, metrics: metrics()}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end

  def metrics do
    [
      # Phoenix Metrics
      summary("phoenix.endpoint.stop.duration",
        unit: {:native, :millisecond}
      ),
      summary("phoenix.router_dispatch.stop.duration",
        tags: [:route],
        unit: {:native, :millisecond}
      ),

      # Database Metrics
      summary("app.repo.query.total_time", unit: {:native, :millisecond}),
      summary("app.repo.query.decode_time", unit: {:native, :millisecond}),
      summary("app.repo.query.query_time", unit: {:native, :millisecond}),
      summary("app.repo.query.queue_time", unit: {:native, :millisecond}),
      summary("app.repo.query.idle_time", unit: {:native, :millisecond}),

      # VM Metrics
      summary("vm.memory.total", unit: {:byte, :kilobyte}),
      summary("vm.total_run_queue_lengths.total"),
      summary("vm.total_run_queue_lengths.cpu"),
      summary("vm.total_run_queue_lengths.io")
    ]
  end

  defp periodic_measurements do
    [
      # A module, function and arguments to be invoked periodically.
      # This function must call :telemetry.execute/3 and a metric must be added above.
      # {AppWeb, :count_users, []}
    ]
  end
end

Be sure to replace AppWeb in the above with your application module name.

Finally, add the telemetry module to your application’s supervision tree in lib/app/application.ex:

children = [
  MyApp.Repo,
  MyAppWeb.Telemetry,
  MyAppWeb.Endpoint,
  . . .
]

After installing dependencies and restarting the application, we’re ready to get started.

Collecting metrics

The AppWeb.Telemetry module can capture telemetry events as they are emitted and turn them into metrics, or it can periodically build custom metrics on a set interval.

A metric — as defined by the telemetry package — is an aggregation of events with a specific name. Exactly how they are aggregated depends on which of the five available metric types we utilize from Telemetry.Metrics:

  • counter/2 - counts the total number of emitted events
  • sum/2 - sums the selected measurement
  • last_value/2 - the value of the selected measurement from the most recent event
  • summary/2 - the statistical mean, maximum, and percentiles for the selected measurement
  • distribution/2 - a histogram of the selected measurement

The module Phoenix generates for us captures a handful of telemetry events that are available out of the box from Phoenix and Ecto as summary metrics. Metrics are defined in the module’s metrics/0 method:

def metrics do
  [
    # Phoenix Metrics
    summary("phoenix.endpoint.stop.duration",
      unit: {:native, :millisecond}
    ),
    summary("phoenix.router_dispatch.stop.duration",
      tags: [:route],
      unit: {:native, :millisecond}
    ),

    # Database Metrics
    summary("app.repo.query.total_time", unit: {:native, :millisecond}),
    summary("app.repo.query.decode_time", unit: {:native, :millisecond}),
    summary("app.repo.query.query_time", unit: {:native, :millisecond}),
    summary("app.repo.query.queue_time", unit: {:native, :millisecond}),
    summary("app.repo.query.idle_time", unit: {:native, :millisecond}),

    # VM Metrics
    summary("vm.memory.total", unit: {:byte, :kilobyte}),
    summary("vm.total_run_queue_lengths.total"),
    summary("vm.total_run_queue_lengths.cpu"),
    summary("vm.total_run_queue_lengths.io")
  ]
end

We can look at the events emitted by phoenix itself and other libraries we use, and capture them as different metrics by adding them here to the metrics/0 method.

The definitions of these metrics alone don’t do much though — we’ll need to send them somewhere.

Reporting metrics

Metrics defined in AppWeb.Telemetry.metrics/0 need to be reported in order for us to make use of them, and so Telemetry.Metrics has the concept of a reporter which does exactly that.

Out of the box, Telemetry.Metrics.ConsoleReporter is provided, which when enabled will output the metrics we define to the phoenix log.

Reporters can be added by specifying them as children to our supervisor in AppWeb.Telemetry.init/1. Let’s enable the console reporter by uncommenting it:

def init(_arg) do
  children = [
    # Telemetry poller will execute the given period measurements
    # every 10_000ms. Learn more here: https://hexdocs.pm/telemetry_metrics
    {:telemetry_poller, measurements: periodic_measurements(), period: 10_000},
    # Add reporters as children of your supervision tree.
    {Telemetry.Metrics.ConsoleReporter, metrics: metrics()}
  ]

  Supervisor.init(children, strategy: :one_for_one)
end

After restarting your application, you should start seeing messages similar to the following in your application logs:

[Telemetry.Metrics.ConsoleReporter] Got new event!
Event name: phoenix.router_dispatch.stop
All measurements: %{duration: 2109000}
All metadata: %{conn: %Plug.Conn{. . .}, . . .} 

[Telemetry.Metrics.ConsoleReporter] Got new event!
Event name: vm.memory
All measurements: %{atom: 729321, atom_used: 702036, binary: 2443936, code: 14887818, ets: 1701368, processes: 8723368, processes_used: 8723368, system: 40073656, total: 48797024}
All metadata: %{}

There are many telemetry_metrics_* packages available via hex which facilitate reporting metrics to common storages such as statsd, Appsignal, or telegraf.

Writing your own reporter is also an option, using the ConsoleReporter as a reference implementation.

Emitting custom telemetry events

So far we’ve looked at using events emitted by third party libraries, but much in the same way, we can emit our own telemetry events and turn them into metrics.

For example, if we wanted to emit a telemetry event when a particular controller action was executed, we could emit the event directly from the controller using the :telemetry module:

defmodule AppWeb.SplashPageController do
  use AppWeb, :controller

  def index(conn, _params) do
    :telemetry.execute(
      [:web, :controller, :action],
      %{},
      %{
        controller: :splash_page,
        action: :index
      }
    )

    render(conn, "index.html")
  end
end

The :telemetry.execute/3 method takes an event name (a string or list of atoms), a map of measurements, and a map of event metadata.

There is nothing special about the event name or the name of the keys in the measurements or metadata maps. They are the same names that consumers of the event will use to listen for and access information about the event, but they can be anything that makes sense for your custom event.

In the above example, we’re not adding any measurements, so consumers of this event would only be able to count the number of times it occured.

We are adding some event metadata though — a :controller and :action key — to allow consumers of the [:web, :controller, :action] event to know which controller and action this event was emitted from.

We could have instead emitted an event that was very specific to this controller action, passing no measurements or metadata:

:telemetry.execute([:web, :controller, :splash_page, :index], %{}, %{})

This would have the same effect, but it is less flexible because it doesn’t allow for consumers to easily capture all controller actions or all events from the `:splashpage` controller_. Metadata gives us a flexible way to keep our event names generic and multipurpose, while filtering down the events to the context we care about.

As for passing measurements to :telemetry.execute/3 - what values you pass depends on what (if anything) you want to measure.

However, you should not use :telemetry.execute/3 to measure code execution time.

For that purpose, :telemetry exposes span/3:

defmodule AppWeb.PageController do
  use AppWeb, :controller

  def index(conn, _params) do
    :telemetry.span(
      [:web, :controller, :action],
      %{},
      fn ->
        result = render(conn, "index.html")
        {result, %{}}
      end
    )
  end
end

The span/3 function in the above example will emit two events instead of just one:

  • In the case the code block succeeded: [:web, :controller, :action, :start] and [:web, :controller, :action, :stop]
  • In the case the code block failed: [:web, :controller, :action, :start] and [:web, :controller, :action, :exception]

Additionally, these events include some default measurements:

  • The start event includes :system_time
  • The end event includes :duration
  • The exception event includes :error

Encapsulating custom metrics

In the previous examples, we just inline the calls to the :telemetry module into our controller, but often it is more readable and reusable to define our events in a separate module.

For that purpose, let’s define a AppWeb.TelemetryEvents module at app/lib/app_web/telemetry_events.ex:

defmodule AppWeb.TelemetryEvents do
  def controller_action(controller, action) do
    :telemetry.execute(
      [:web, :controller, :action],
      %{},
      %{
        controller: controller,
        action: action
      }
    )
  end
end

Then, when we want to emit that event from our controller, we can write:

defmodule AppWeb.PageController do
  use AppWeb, :controller

  def index(conn, _params) do
    AppWeb.TelemetryEvents.controller_action(:splash_page, :index)

    render(conn, "index.html")
  end
end

This method of defining events as methods in a dedicated module allows us to ensure that our event names are consistent when we use them throughout our application, and that changes to our events don’t require changes within our application code.

Final thoughts

Having built a good foundation for collecting application metrics, it is straightforward to add additional metrics as we need them.

When it comes to the next step — What should we measure? — we should avoid rushing to answer that question.

We can add metrics when we notice performance smells in our application, or when a requirement comes along to capture additional data for analytics purposes.

It’s important to only measure what we need. Measuring too much will have an impact on application performance, and result in more data we need to sort through when trying to diagnose a problem.

The question of Where will we store those measurements? is now an implementation detail. Our foundation uses swappable metrics reporters, and we can have one or multiple of them active at one time!

As a bonus: be sure to check out Phoenix LiveDashboard which can utilize these metrics in the same way and visualize them into a LiveView dashboard hosted by your Phoenix app.

Permalink

What's new in MongooseIM 4.0 - The friendly Mongoose

Hello from the team at MongooseIM

It’s been busy four months. As most of us were locked in our homes, we decided to put it to use and prepare a really special release. We introduced a new configuration format, structured logging and many new features and extensions that add up to the product we are proud to share with you. MongooseIM has always empowered users to create a customised, owned chat application without the struggle of building one from scratch, now we’ve made these amazing features even more accessible and easy to use.

Friendly to developers with TOML configuration

We want everyone to be able to benefit from MongooseIM, and so it was a rude awakening to hear the configuration described as ‘the trenches of the Somme’ by one of our users. Given we love Erlang, we hadn’t considered that its configuration might be a barrier for some developers. Once we read the feedback we knew that had to change. In the release of 4.0 we are introducing a new configuration format. For that task we’ve decided to go with TOML. Thanks to its merits we managed to get rid of most of the nested lists and tuples that sometimes were followed by a comma and at other times by a dot. We have simplified the syntax and disassembled the trenches while keeping the required configuration capabilities.
If you want to have a closer look at the new configuration format, please have a look at the Configuration section in our docs.

Friendly to Kubernetes with Helm Charts

We all like to have the installation procedure as simple as installing a package. So we’ve made it possible to install MongooseIM and MongoosePush on Kubernetes through the Helm Chart.
You can find the Helm Charts of our organisation at the link below:
https://artifacthub.io/packages/search?page=1&org=mongoose

Friendly to DevOps with structured logging

In MongooseIM 4.0 we’re introducing structured logs. This can help to have more precise and clearer structure when we query events that are logged. This is a tool you may need not often, but when you do need it, you’ll be so glad you have it because it makes it significantly easier to find exactly what you’re looking for.
If you are not yet familiar with the new OTP logger and structured logs we recommend having a look at this https://ferd.ca/erlang-otp-21-s-new-logger.html blogpost.

Friendly for users with video and voice calling

With the new release, we added the implementation for XEP-0215: External Service Discovery which assists in discovering information about services external to the XMPP network. The main use-case is to help discover STUN/TURN servers to allow for negotiating media exchanges.
So if you want to have a video/voice call using MongooseIM and Conversations now you can. You can use MongooseICE as a STUN/TURN relay, configure MongooseIM with mod_extdisco enabled and start having video calls between connected users.
For more details on how to use and setup mod_extdisco and our STUN/TURN server stay tuned to our future blog posts and in the meantime, please see our documentation page: https://mongooseim.readthedocs.io/en/latest/modules/mod_extdisco/

Friendly for everyone with improvements to MongoosePush

We’ve released a new MongoosePush. In the 2.1 release you will find:
OpenAPI Specs
Phoenix as the Web Framework
Structured Logs, logfmt and JSON formatters
Metrics: Exometer to Telemetry, Multidimensional metrics
Many improvements in the testing pipeline

For more information on the new MongoosePush, please have a look at the release notes https://github.com/esl/MongoosePush/releases/tag/2.1.0

Friendly for managers with AMOC 2.1 Load testing

We released AMOC 2.1. This release focuses on the REST API, which is now powered by OpenAPI Specifications generated by openapi-generator. We’ve also significantly reworked the RestAPI so you can upload files with simple put requests. With the newly introduced documentation API for scenarios you can now check what the scenario is about before running it. Finally, the execution API was updated and now you have full control of options such as starting, stopping scenario, adding, removing users. This makes load testing even easier so you can demonstrate the value of MongooseIM to your management team.

Let’s be friends!

So if you ever considered MongooseIM for your product or a project but you didn’t choose it for some reason, it’s time to give it a try. It’s the most robust, scalable and now easiest to configure Instant Messaging solution available on the market. Learn more about how MongooseIM stacks up against the competitors in terms of key considerations like costs and features in our complete guide to choosing a messaging app. Or, explore the MongooseIM page.

One last word from your friends at MongooseIM

load testing diagram

After working hard to get the new release live, we wanted to show off a little creative spirit. Here’s the MongooseIM’s team summary of MongooseIM 4.0 as inspired by the theme song to Friends! https://www.youtube.com/watch?v=sLisEEwYZvw

So no one told your MongooseIM 4.0 was gonna be this way
When your app’s won’t scale, you’re broke
Your XMPP life’s DOA
It’s like you’re always stuck with a single node
When it hasn’t been your day, your week, your month
Or even your year, but
MongooseIM will be there for you
(When the rain of messages starts to pour)
MongooseIM will be there for you
(When you like to configure with TOML)
MongooseIM will be there for you
(‘Cause structured logs are for people too)

You may also like:

Our complete guide to Instant Messaging

MongooseIM - page

How to add messaging - webinar

Testing the scalability of MongooseIM - blog

How we can add value to your product with chat - blog

Permalink

Performance testing the JIT compiler for the BEAM VM

Erlang JIT

In early September, our colleague, Lukas Larsson, announced the pull request for BEAMAsm, a just-in-time compiler for the Erlang Virtual Machine. Lukas worked on the development of this feature in conjunction with the OTP core team at Ericsson. The JIT compiler is a huge addition to the BEAM community. Two of its most impressive features are the speed improvements it delivers, with the JIT compiler offering anything from 30% to 130% increase in the number of iterations per second. The second thing that it offers is the perf integration which allows developers to profile where bottlenecks are in production.

In this blog, we’ll take you through how we conducted the benchmarking and performance testing of the JIT using a RabbitMQ deployment. If you’d like to see the talk from Code BEAM V, watch the video below.

RabbitMQ benchmarks

Like any performance comparison, the first thing we need to do is to conduct benchmark testing for RabbitMQ. To do this, we need to create a docker image with the Erlang interpreter and a separate one with the JIT. Below is what such a docker file could look like.

FROM docker.pkg.github.com/erlang/otp/ubuntu-base

RUN apt-get update && apt-get install -y rabbitmq-server git linux-tools-generic

## Enable us to connect from outside the docker container without auth
RUN echo "loopback_users = none" >> /etc/rabbitmq/rabbitmq.conf

ENV MAKEFLAGS=-j4 \
        ERLC_USE_SERVER=yes \
        ERL_TOP=/buildroot/otp \
        PATH=/otp/bin:$PATH

WORKDIR /buildroot

## Download Erlang/OTP with JIT support
RUN git clone https://github.com/erlang/otp && cd otp && \
        git checkout 4d9f947ea71b05186d25ee346952df47d8339da6

WORKDIR /buildroot/otp/

## Build Erlang/OTP with JIT support
RUN ./otp_build setup -a --prefix=/otp/ && make install

## Build Erlang/OTP without JIT support
RUN make install FLAVOR=emu

USER rabbitmq

CMD "rabbitmq-server"

Then we build it as follows below:

docker build -t rabbit .

Then we can start RabbitMQ with the JIT:

docker run -d -e ERL_FLAGS="" -p 5672:5672 -p 15672:15672

We can also start it with the interpreter.

docker run -d -e ERL_FLAGS="-emu_flavor emu" -p 5672:5672 -p 15672:15672

RabbitMQ PerfTest

We then use RabbitMQ PerfTest to measure the difference in the number of messages per second:

Single Queue Performance

> docker run -d -e ERL_FLAGS="-emu_flavor emu" -p 5672:5672 -p 15672:15672 rabbit
476fba6ad56c5d8b34ceac9336b035737c021dee788f3f6c0d21b9309d67373e
> bin/runjava com.rabbitmq.perf.PerfTest --producers 1 --consumers 1 --queue q1 --flag mandatory --qos 300 --confirm 100
id: test-104559-735, starting consumer #0
id: test-104559-735, starting consumer #0, channel #0
id: test-104559-735, starting producer #0
id: test-104559-735, starting producer #0, channel #0
id: test-104559-735, time: 1,000s, sent: 13548 msg/s, returned: 0 msg/s, confirmed: 13449 msg/s, nacked: 0 msg/s, received: 13433 msg/s, min/median/75th/95th/99th consumer latency: 404/5648/7770/11609/14864 µs, confirm latency: 241/3707/5544/9435/11787 µs
id: test-104559-735, time: 2,000s, sent: 20558 msg/s, returned: 0 msg/s, confirmed: 20558 msg/s, nacked: 0 msg/s, received: 20569 msg/s, min/median/75th/95th/99th consumer latency: 919/4629/6245/9456/16564 µs, confirm latency: 440/3881/4493/6971/9642 µs
id: test-104559-735, time: 3,000s, sent: 26523 msg/s, returned: 0 msg/s, confirmed: 26530 msg/s, nacked: 0 msg/s, received: 26526 msg/s, min/median/75th/95th/99th consumer latency: 1274/3689/3970/4524/5830 µs, confirm latency: 495/3617/3842/4260/4754 µs
id: test-104559-735, time: 4,000s, sent: 25835 msg/s, returned: 0 msg/s, confirmed: 25827 msg/s, nacked: 0 msg/s, received: 25834 msg/s, min/median/75th/95th/99th consumer latency: 1946/3830/4124/4451/5119 µs, confirm latency: 1852/3760/4051/4415/5275 µs
id: test-104559-735, time: 5,000s, sent: 25658 msg/s, returned: 0 msg/s, confirmed: 25658 msg/s, nacked: 0 msg/s, received: 25659 msg/s, min/median/75th/95th/99th consumer latency: 556/3840/4110/4646/7325 µs, confirm latency: 1611/3727/4020/4474/5996 µs
....
id: test-104559-735, time: 15,000s, sent: 25180 msg/s, returned: 0 msg/s, confirmed: 25177 msg/s, nacked: 0 msg/s, received: 25182 msg/s, min/median/75th/95th/99th consumer latency: 843/3933/4152/4543/5517 µs, confirm latency: 863/3898/4110/4495/9506 µs
^Ctest stopped (Producer thread interrupted)
id: test-104559-735, sending rate avg: 24354 msg/s
id: test-104559-735, receiving rate avg: 24352 msg/s
> docker stop 476fba6ad56c5d8b34ceac9336b035737c021dee788f3f6c0d21b9309d67373e

The above results are for a single queue performance when under heavy load.

We see that the average sending and receiving rates are both 24k. If we complete the same tasks with the JIT compiler we get the following results:

> docker run -d -e ERL_FLAGS="" -p 5672:5672 -p 15672:15672 rabbit
993b90ab29f662b45ad4cab7d750a367ac9e5d47381812cad138e4ec2f64b2a3
> bin/runjava com.rabbitmq.perf.PerfTest -x 1 -y 1 -u q1 -f mandatory -q 300 -c 100
...
id: test-105204-749, time: 15,000s, sent: 39112 msg/s, returned: 0 msg/s, confirmed: 39114 msg/s, nacked: 0 msg/s, received: 39114 msg/s, min/median/75th/95th/99th consumer latency: 1069/2578/2759/3101/4240 µs, confirm latency: 544/2453/2682/2948/3555 µs
^Ctest stopped (Producer thread interrupted)
id: test-105204-749, sending rate avg: 36620 msg/s
id: test-105204-749, receiving rate avg: 36612 msg/s
> docker stop 993b90ab29f662b45ad4cab7d750a367ac9e5d47381812cad138e4ec2f64b2a3

Which is 36K msgs/second already, we’ve seen a 45% increase in the number of messages per second.

Multi-Queue Performance

With a 45% increase in a single queue performance, it’s time to test how the JIT compares when looking at the multi-queue performance. Below are the results for the Erlang interpreter.

bin/runjava com.rabbitmq.perf.PerfTest -x 150 -y 300 -f mandatory -q 300 -c 100 --queue-pattern 'perf-test-%d' --queue-pattern-from 1 --queue-pattern-to 100
...
id: test-105930-048, time: 60,396s, sent: 39305 msg/s, returned: 0 msg/s, confirmed: 39293 msg/s, nacked: 0 msg/s, received: 39241 msg/s, min/median/75th/95th/99th consumer latency: 44909/352133/503781/742036/813501 µs, confirm latency: 36217/352381/491656/714740/821029 µs
id: test-105930-048, sending rate avg: 37489 msg/s
id: test-105930-048, receiving rate avg: 37242 msg/s

Here you can see 37,489 messages sent per second and 37,242 received.

Now let’s test that benchmark against the JIT:

bin/runjava com.rabbitmq.perf.PerfTest -x 150 -y 300 -f mandatory -q 300 -c 100 --queue-pattern 'perf-test-%d' --queue-pattern-from 1 --queue-pattern-to 100
...
id: test-105554-348, time: 60,344s, sent: 50905 msg/s, returned: 0 msg/s, confirmed: 50808 msg/s, nacked: 0 msg/s, received: 50618 msg/s, min/median/75th/95th/99th consumer latency: 32479/250098/359481/565181/701871 µs, confirm latency: 16959/257612/371392/581236/720861 µs
^Ctest stopped (Producer thread interrupted)
id: test-105554-348, sending rate avg: 47626 msg/s
id: test-105554-348, receiving rate avg: 47390 msg/s

Again, we have a significant boost in performance, this time there are 30% more messages sent and received.

Profiling

We can use perf to profile what is happening under-the-hood in RabbitMQ. We then need to add some permissions to run the perf tool. There are many ways to do that, the simplest (though not the most secure) is to add --cap-add SYS_ADMIN when starting the container. I’ve also added +S 1 because that makes the perf output a little easier to reason about and +JPperf true to be able to disassemble the JIT:ed code.

> docker run -d --cap-add SYS_ADMIN -e ERL_FLAGS="+S 1 +JPperf true" -p 5672:5672 -p 15672:15672 rabbit
514ea5d380837328717

Then we can run perf in an exec session to the container:

> docker exec -it 514ea5d380837328717 bash
> perf record -k mono --call-graph lbr -o /tmp/perf.data -p $(pgrep beam) -- sleep 10
[ perf record: Woken up 57 times to write data ]
[ perf record: Captured and wrote 15.023 MB /tmp/perf.data (54985 samples) ]
> perf inject --jit -i /tmp/perf.data -o /tmp/perf.data.jit
> perf report -i /tmp/perf.data

If you run the above, while the RabbitMQ PerfTest is running, it will display as seen in the image below:

load testing diagram

The default report from perf sorts the output according to the accumulated run-time of a function and its children. So from the above, we can see that most of the time seems to be spent doing bif calls closely followed by the gen_server2:loop and gen_server2:handle_msg.

It makes sense for gen_server2 to be at the top of accumulated time as that is where most of RabbitMQ’s code is called from, but why would the accumulated time for bif be more extensive? The answer is in the usage of lbr (last branch records) as the call stack sampling method. Since the lbr buffer is limited, it does not always contain all the call stack of a process, so when calling bifs that have a lot of branches, perf loses track of the calling process and thus cannot attribute the time to the correct parent frame. Using lbr has a lot of drawbacks and you have to be careful when analyzing its finsing, however, it is better than having no call stack information at all.

Using the information from the lbr we can expand the call_light_bif function and see which bifs we are calling and how what each call takes:

load testing diagram

Here we can see that a lot of accumulated time is spent doing erlang:port_command, ets:select and erlang:port_control.

So the RabbitMQ application is busy doing tcp communication and selecting on an ets table. If we wanted to optimize RabbitMQ for this benchmark, it would be a good start to see if we could eliminate some of those calls.

Another approach to viewing the perf report is by sorting by time spent in each function without its children. You can do that by calling perf as follows:

> perf report --no-children -i /tmp/perf.data

load testing diagram

The functions that we spent the most time on here are do_minor (which is part of the emulator that does garbage collection), make_internal_hash (the hashing function for terms used internally that does not have to be backwards compatible)and eq (the function used to compare terms). If we expand make_internal_hash and eq we see something interesting:

load testing diagram

The top user of both functions is erlang:put/2, i.e. storing a value in the process dictionary. So it would seem that RabbitMQ is doing a lot of process dictionary updates. After some digging, I found that these calls come from rabbit_reader:process_frame/3 and credit_flow:send/2 which seem to deal with flow control within RabbitMQ. A possible optimization that could be done here is to use the fact that literal keys to erlang:put/2 are pre-hashed when the module is loaded. That is instead of using a dynamic key like this:

erlang:put({channel, Channel}, Val)

you use a literal like this:

erlang:put(channel, Val)

and the emulator will not have to calculate the hash every time we put a value in the process dictionary. This is most likely not possible in this specific case for RabbitMQ, but it’s worth keeping in mind for further investigations.

We’re excited about the JIT compiler and its use cases. If you’ve put it to use and can share some insight, get in touch with us, as we would be happy to co-organise a webinar on the JIT compiler! If you’d like to more help understanding how the JIT compiler can improve the performance of your RabbitMQ platform, Phoenix application, or any other Erlang/Elixir system, get in touch we’re always happy to help!

You may also like:

Code Mesh V

Our RabbitMQ services

Our Erlang & Elixir Consultancy

Online Erlang and Elixir training

Permalink

Real time communication at scale with Elixir at Discord

Welcome to our series of case studies about companies using Elixir in production. Click here to see all cases we have published so far.

Founded in 2015 by Jason Citron and Stan Vishnevskiy, Discord is a permanent, invite-only space for your communities and friends, where people can hop between voice, video, and text, depending on how they want to talk, letting them have conversations in a very natural or authentic way. Today, the service has over 100 million monthly active users from across the globe. Every day people spend 4 billion minutes in conversation on Discord servers, across 6.7 million active servers / communities.

From day one, Discord has used Elixir as the backbone of its chat infrastructure. When Discord first adopted the language, they were still working on building a viable business, with many questions and challenges in front of them. Elixir played a crucial role in giving them the desired technological flexibility to grow the company and also became the building block that would allow their systems to run on a massive scale.

Discord

Starting technologies

Back in 2015, Discord chose two main languages to build their infrastructure: Elixir and Python. Elixir was initially picked to power the WebSocket gateway, responsible for relaying messages and real-time replication, while Python powered their API.

Nowadays, the Python API is a monolith while the Elixir stack contains 20 or so different services. These architectural choices do not represent a dichotomy between the languages but rather a pragmatic decision. Mark Smith, from the Discord team, explains it succinctly: “given the Elixir services would handle much bigger traffic, we designed them in a way where we could scale each service individually.”

Discord has also explored other technologies along the way, Go and Rust being two examples, with distinct outcomes. While Discord completely phased out Go after a short foray, Rust has proven to be an excellent addition to their toolbox, boosted by its ability to play well with Elixir and Python.

Communication at scale

Effective communication plays an essential role when handling millions of connected users concurrently. To put things into perspective, some of Discord’s most popular servers, such as those dedicated to Fortnite and Minecraft, are nearing six hundred thousand users. At a given moment, it is not unlikely to encounter more than two hundred thousand active users in those servers. If someone changes their username, Discord has to broadcast this change to all connected users.

Overall, Discord’s communication runs at impressive numbers. They have crossed more than 12 million concurrent users across all servers, with more than 26 million WebSocket events to clients per second, and Elixir is powering all of this.

In terms of real time communication, the Erlang VM is the best tool for the job.

— Jake Heinz, Lead Software Engineer

When we asked their team “Why Elixir?”, Jake Heinz gave a straight-forward answer: “In terms of real time communication, the Erlang VM is the best tool for the job. It is a very versatile runtime with excellent tooling and reasoning for building distributed systems”. Technologically speaking, the language was a natural fit. However, Elixir was still a bet back in 2015: “Elixir v1.0 had just come out, so we were unsure in which direction the language would go. Luckily for us, we have been pleased with how the language has evolved and how the community shaped up.”

The chat infrastructure team

To power their chat messaging systems, Discord runs a cluster with 400-500 Elixir machines. Perhaps, the most impressive feat is that Discord’s chat infrastructure team comprises five engineers. That’s right: five engineers are responsible for 20+ Elixir services capable of handling millions of concurrent users and pushing dozens of millions of messages per second.

Discord also uses Elixir as the control plane of their audio and video services, also known as signaling, which establishes communication between users. C++ is then responsible for media streaming, a combination that altogether runs on 1000+ nodes.

The Elixir services communicate between them using Distributed Erlang, the communication protocol that ships as part of the Erlang Virtual Machine. By default, Distributed Erlang builds a fully meshed network, but you can also ask the Erlang VM to leave the job of outlining the topology up to you, by setting the aptly named -connect_all false flag. The Discord team sets this option to assemble a partially meshed network with etcd being responsible for service discovery and hosting shared configuration.

The chat infrastructure developers are not the only ones touching the Elixir codebases. According to Mark Smith, this is an important part of Discord’s culture: “We don’t work in silos. So a Python developer may have to work on the Elixir services when building a new feature. We will spec out the feature together, figure out the scalability requirements, and then they will work on a pull request, which we will review and help them iterate on it.”

Community and challenges

To run at this scale, Discord learned how to leverage the Erlang VM’s power, its community, and when to recognize challenges that require them to reach for their own solutions.

For example, Discord uses Cowboy for handling WebSocket connections and TCP servers. To manage data bursts and provide load regulation, such as back-pressure and load-shedding, they use GenStage, which they have discussed in detail in the past.

Other times, the efforts of the company and the community go hand in hand. That was the case when Discord used the Rustler project, which provides a safe bridge between Elixir and Rust, to scale to 11 million concurrent users. They used the Rustler to hook a custom data structure built in Rust directly into their Elixir services.

However, the team has made abundantly clear that the powerhouse is the Erlang platform. Every time they had to push their stack forward, they never felt cornered by the technology. Quite the opposite, their engineers could always build efficient solutions that run at Discord’s scale, often in a few hundred lines of code. Discord frequently gives these projects back to the community, as seen in Manifold and ZenMonitor.

The Discord team also adapted quickly when things went wrong. For instance, they attempted twice to use Mnesia in production —a database that ships as part of Erlang’s standard library. They tried Mnesia in persistent and in-memory modes, and the database nodes would often fall behind in failure scenarios, sometimes being unable to ever catch up. Eventually they ditched Mnesia altogether and built the desired functionality with Erlang’s builtin constructs, such as GenServer and ETS. Nowadays, they resolve these same failure scenarios within 2-3 seconds.

Mastering Elixir

None of the chat infrastructure engineers had experience with Elixir before joining the company. They all learned it on the job. Team members Matt Nowack and Daisy Zhou report initially struggling to understand how all of their services communicate. Matt adds: “In the beginning, it was hard to accept all of the guarantees that Erlang VM provides. I’d worry about data races and concurrency issues that were impossible to happen”. Eventually, they took these guarantees to heart and found themselves more productive and more capable of relying on the platform and its tools. Matt continues: “The introspection tools the Erlang VM provides is the best in class. We can look at any VM process in the cluster and see its message queue. We can use the remote shell to connect to any node and debug a live system. All of this has helped us countless times.”

Running at Discord’s scale adds its own dimension to mastering the language, as they need to familiarize with the abstractions for providing concurrency, distribution, and fault-tolerance. Nowadays, frameworks such as Nerves and Phoenix handle these concerns for developers, but the underlying building blocks are always available for engineers assembling their own stack, such as the Discord team.

In the end, Jake summarized how crucial Elixir and the Erlang VM have been at Discord and how it affected him personally: “What we do in Discord would not be possible without Elixir. It wouldn’t be possible in Node or Python. We would not be able to build this with five engineers if it was a C++ codebase. Learning Elixir fundamentally changed the way I think and reason about software. It gave me new insights and new ways of tackling problems.”

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.