Erlang Installer: A Better Way To Use Erlang On OSX

Back in 2013, I joined Erlang Solutions to work on providing something more convenient for Erlang developers than self-compiled bundles of OTP apps. I'm happy to say that in that time, we've managed to significantly improve the ease of deploying Erlang on a variety of systems, especially variants of Linux.

More recently, we've decided to make obtaining Erlang/OTP as painless as possible on OS X. To that end, we've recruited the wonderful people at Inaka Networks to help us create a better, slicker version of the Erlang Solutions OS X Installer.

Motivation for changes

The previous iteration of the Installer supports auto-updating when a new version comes out.

This can be a very useful feature for the people who like to stay on the cutting edge, but a lot of serious developers like to stick to an older version they know is supported by the software they are using. Moreover, looking around our own company, we have noticed people forgo downloading our own Installer because it does not offer an indispensable feature: the ability to quickly switch between different versions. To many devs, it is important to be able to switch to an old Erlang version in order to test a patch on a legacy system, before quickly popping back to 19.0. Some people keep 4-5 different versions of Erlang on their machine! This is why we've decided to change the ESL Installer to allow this kind of feature, while incorporating changes that will make the app feel more modern.

Installation

Installation of the new ESL Installer is as easy as downloading and then drag&dropping it into your Applications folder.

The app will show up in the tray, and its preferences can be accessed through the OS X System Preferences.

Usage

When clicked, the tray icon expands a menu that allows you to download releases, start a shell with one of the releases you already downloaded, force a check for new releases or updates, and to quickly access the preferences.

Downloading releases

The Releases tab of the installer allows you install or uninstall various versions of Erlang. At release, we will be providing the newest version of each numbered release since R15; if you find the list is missing a version you would like to be there let us know and we'll try to provide it for you!

Settings

The General tab of the Installer allows you to setup your preferences, including the default terminal and release you'd like to use, automatic checks for updates and automatically starting the Installer on system boot, all to ensure you stay up to date with new things in the world of Erlang.

Other perks

The new Installer app should now look more similar to the other OS X apps you know and love, and instead of an ugly, annoying popup you should now be getting much less obstructive notifications.

Download link

To try out the new Installer, go to the direct link on our webpage.

In conclusion

We're hoping that the release of this new, more comprehensive ESL installer will convince some of the developers who previously found it lacking features to try it out again and see if it improves their Erlang coding experience. If you have any suggestions on features you'd like to see or improvements you think we can make, give us a note at packages@erlang-solutions.com.

Permalink

Operational nightmare fun: dealing with misconfigured Riak Clusters

Everyone has a strong opinion about DevOps. Either you love it, or you hate it. Personally, we believe in DevOps. In the Erlang world, we were doing DevOps even before the term was invented. The power of the Erlang Virtual Machine and the visibility you have when troubleshooting is second to none, but you need to know what you are doing, as you can affect the system and even cause outages. Speaking with developers who dislike DevOps, we are often told that the fun ends with delivering the product. They believe support is boring, and is only about being able to master the stack’s configuration possibilities. We believe they are so wrong! Give us a few minutes to tell you a story, and then you can make up your own mind.

This is a battle story from the trenches. It is a battle story about supporting and operating a production system where outages result in extensive financial losses.

Background

We were providing 24/7 support to a customer running a Riak[1] cluster with 6 nodes. Riak is an open source NoSQL database from Basho Technologies. The operations team were masters at general pre-emptive support. They monitored the CPU, IO, memory and the network utilisation; checked and analysed the graphs regularly. However, they weren't Erlang or Riak experts, so we often helped them recover from the failure of individual nodes and understand and adjust their cluster configuration. What they lacked, however, was an Erlang centric view of the system.

The nodes were using a relatively old version of Riak, but the DevOps team was happy with the performance. Their policy was to upgrade only when they came across issues which were fixed in later versions. We came across one of these issues that required an upgrade from to the latest stable Riak release, which at the time had reached 2.0.6. As is customary with huge hops between versions, upgrades that require no downtime are done in steps, using intermittent releases where upgrades between them have been tested thoroughly. This mitigates risk and reduces the need to customize upgrade scripts.

We started preparing for this particular upgrade, planning to first upgrade to version 1.4.12, followed by an upgrade to version 2.0.6. We provided a proposal and discussed the upgrade process in detail. Our customer upgraded their test cluster to gain the necessary experience. The trial run went smoothly, resolved all of the issues they were experiencing, giving everyone, us included, a false sense of optimism. Our customer couldn't wait to see the upgraded production cluster in action.

The first upgrade to version 1.4.12 went as planned, although much slower than expected. Immediately after the upgrade, we ran our routine health checks together with the operations team. The results were positive and the cluster was able to serve all the requests. Everyone was now looking forward to upgrading to version 2.0.6. Unfortunately, it did not happen as quickly as we hoped.

Early warning signs

Historically, the cluster was always close to hit its performance limits, but it always seemed to cope. The most serious problems were the large IO latency and at times, the unreliable network. The physical machines were equipped with slow disks and the network cards occasionally dropped packets. Also, the partitions handled by each Riak node were too many and too large. The cluster would have had much better performance if one or two new Riak nodes would had been added, as well as replacing the slow disks with new SSDs. Nonetheless, the cluster was able to serve requests but had no spare capacity to tolerate extra load.

A week after the first upgrade (to version 1.4.12), the operations team noticed that the trends of the IO graphs had changed. Disk usage of the Riak nodes had increased dramatically (by around 600 GB per week and node) and wasn't in line with the application's usage pattern. Also, the memory usage of the Riak processes was increasing. Operations decided to reboot the cluster in a rolling manner, as it would trigger Riak to free up used disk space it no longer needed.

Things getting worse

Monday evening, whilst reading a great book (which will be covered in another blog post, another day), a PagerDuty ALERT broke the silence.

What happened?!?

The operations team had started to reboot the Riak nodes in a rolling manner to free up the disk space whilst applying the configuration changes we had recommended. This worked until the third host was restarted. More and more incoming requests were timing out. This got the operations team worried, prompting them to escalate the issue to us. Large spikes appeared in the IO, memory and in the network graphs. The cluster was struggling to handle its load. The logs and graphs we inspected confirmed that the nodes were overloaded.

To understand the symptom and the cause, we need to step back and understand partition transfers [2]. In Riak data is organised into partitions. Each Riak node is responsible for a set of partitions. When a Riak node is down (e.g. because it is being restarted), its partitions are automatically taken over by the other nodes. For each partition, a secondary partition is created on another node. Existing data is not transferred, but whenever new data reaches the cluster, it is stored in the secondary partition. When the Riak node which was down recovers, the data in the secondary partitions is transferred back to the primary partition of the original node.

Back to our incident. We noticed that many transfers had stalled; the data from the secondary partitions was not moving to the primary ones on the newly restarted nodes. To make matters worse, the transfer coordinator process had a large message queue. We kept on monitoring the system and gathering data whilst investigating the cause. When doing so, the overload issue resolved itself and the cluster returned to normal, serving requests as expected.

What caused the overload? In our investigations, we noticed a change in the configuration files. It turned out that before rebooting the Riak nodes, the operations team had changed the number of allowed concurrent partition transfers on a node from the default setting of 2 to 50. They were hoping in a quicker transfer rate, but instead, they overloaded the nodes making them unresponsive. To add the icing on the cake, the extra traffic also caused further problems by contributing to network saturation.

Wouldn’t it have been great if the unexpected config change had been immediately visible for us? In this case, we would have immediately known what the root cause was... Also, checking the messages of the stuck manager process or inspecting the processes responsible for the stalled transfers would have made us much more effective! Not considering if we would have been able to see the complete, detailed history of what happened to the partitions instead of having only a limited number of system snapshots.

Crisis

After the concurrent transfer values got fixed, we reviewed all node configuration in the Riak cluster and found nothing suspicious. Even though the overload issue resolved itself, we were still experiencing problems. Not a single bitcask merge (Riak's garbage collection mechanism[4]) had run to completion, resulting in increasing IO and memory utilisation. The transfers were still stalled, so the nodes managing their usual load had to also handle the load alongside the load generated from the secondary partitions. The extra overhead in all nodes resulted in a constant high load, a problem that had become the norm rather than the exception. It had to be addressed.

We decided to focus on the stalled transfers in the hope that resolving them would also resolve the load issues. When Riak documentation and user mailing list don’t help, you start reading the source code. We spent days trying to identify the parts which could be broken and cause the issue we were facing. Whenever we found a potential culprit, we asked the operations team to execute Erlang command, giving us feedback. Our progress was slow, as we were not on the same location as the client and did not have access to the production servers. All commands and trouble shooting had to be done by the customer.

How fast would our progress have been if the process info, the process messages, and the process state had been immediately available for us? Here delays matter as capturing important information is often possible only for a very limited time. Thus, many times we simply missed the opportunity.

Incident report

When investigating the root cause,

three of the six nodes crashed, causing a major outage. The reason for the crash was shown by dmesgdmesg is a command on most Unix-like operating systems that prints the message buffer of the kernel.

Out of memory: Kill process 9047 (beam.smp) score 985 or sacrifice child
Killed process 9047, UID 496, (beam.smp) total-vm:364260424kB, anon-rss:260285656kB, file-rss:12kB
possible SYN flooding on port 8087. Sending cookies.

 

We were able to restart the nodes and fully restore service in less than an hour, but the transfers were still stalled, affecting even more partitions as more transfers got stalled.

What if we had more detailed records of memory metrics, such as atom memory, process memory, binary memory, ETS memory and system memory? These metrics would have given insight into the root cause, because we would have seen binary memory and process memory usage shoot up.

First aid

Analysing the logs and the metrics it was clear that issues were ongoing and that the nodes would soon crash again. Most noteworthy was an increase in memory consumption. We sent an Erlang command that forced a garbage collection on all Erlang processes. The first attempt failed as the command was pasted incorrectly. The second attempt freed up half of the used memory! It was a quick win allowing the operations team to resume traffic back to normal levels, giving us the breathing space we needed to resume our investigations on the stalled transfers.

How fast would it have been, if the logs had already been merged and we had been able to effectively filter them? Then, we wouldn't have spent valuable time handling them. Also, how safe would it have been if the correct Erlang commands had been available for the operators?

As we managed to recover from the outage we returned to our main question.

Why are the transfers stalled?

We were studying a collection of output generated by riak-admin top -sort msg_q on the nodes. The output is a table where each row represents a process in the Riak node. The processes with the longest message queues are shown, sorted by the message queue length.

Pid                 Name or Initial Func         Time       Reds     Memory       MsgQ Current Function
---------------------------------------------------------------------------------------------------------
<6287.597.0>        proxy_riak_kv_vnode_54806     '-'    2889294   65919144      53856 gen:do_call/4
<6287.593.0>        proxy_riak_kv_vnode_45671     '-'    2889245   65893896      53067 gen:do_call/4
<6287.577.0>        proxy_riak_kv_vnode_91343     '-'    2889379   65851880      51754 gen:do_call/4
<6287.573.0>        proxy_riak_kv_vnode_0         '-'    2247316   52990840      51064 gen:do_call/4
<6287.581.0>        proxy_riak_kv_vnode_18268     '-'    2247417   52984728      50873 gen:do_call/4
<6287.601.0>        proxy_riak_kv_vnode_63940     '-'    2247281   52958968      50068 gen:do_call/4
<6287.243.0>        riak_core_vnode_manager       '-'   73592828    5020184      28179 riak_core_util:pmap_collect_one/1
<6287.632.0>        proxy_riak_kv_vnode_13473     '-'      16317      21744          5 gen:do_call/4
<6287.623.0>        proxy_riak_kv_vnode_11417     '-'        246      13816          4 gen:do_call/4
<6287.592.0>        proxy_riak_kv_vnode_43388     '-'      15405      13784          3 gen:do_call/4

 

breakthrough in our investigations occurred when we noticed that the vnode proxy process message queue sizes didn't change. The role of these vnode proxy processes is to protect their vnode processes from overload. Each vnode process[3] manages a data partition, so every message addressed to a vnode process goes through the corresponding proxy process. The following drawing shows that in the normal flow of events, the proxy simply forwards the messages (top image), but in our case, the messages were stuck in the mailboxes of the proxy processes (bottom image):

 

Examining the source code, we found the names of these processes contained the id of the partition being managed by their vnodes. And guess which partitions they referred to? The ones that were stalled. Yay, so nice! But wait a minute. Why were these processes overloaded? The nodes handling the primary partitions were up, so the secondary partitions had no work to do. Only a very limited number of requests should have reached them. We sent a few Erlang commands to the operations team, asking them to retrieve the process info, the process messages, the process state of the proxies and the corresponding vnode processes. The result was a real surprise!

What we found was the proxies had false assumptions about their vnodes. The vnodes had no messages in their message queues, but the proxies thought they were overloaded, so the overload protection kicked in dropping any incoming requests, including requests for completing the pending transfers. We discovered a new riak_core issue[5] affecting all Riak versions between 1.4.12 and 2.1!

Details are in the Github ticket with the most important debugging results reproduced here:

(riak@riak-06)1> erlang:process_info(Vnode, message_queue_len).
{message_queue_len,0} % Actually, the vnode message queue is 0.

(riak@riak-06)2> sys:get_status(VnodeProxy).
{status,
 <0.548.0>,
 {module,riak_core_vnode_proxy},
[[{'$ancestors',[riak_core_vnode_proxy_sup,riak_core_sup,<0.212.0>]},
  {'$initial_call',{riak_core_vnode_proxy,init,1}}],
  running,
  <0.221.0>,
  [],
  {state,
   riak_kv_vnode,
   182687704666362864775460604089535377456991567872,
   <0.28565.3961>,
   #Ref<0.0.12284.257228>,
   11495, % It believes its vnode message queue is 11495 long!
          % Thus the overload  protection is active.
   10000,
   1571,
   5000,
   2500,
   undefined}]}

 

We were pretty excited to find the root cause of the problem, but the production server was still struggling as the message queues were not getting any shorter. We couldn't expect the proxies to repair themselves quickly, as the primary vnodes handling these partitions were up and serving all incoming requests. As a result, no requests with the exception of the requests for completing the transfers reached the faulty proxies, slowing down their recovery. By studying the proxy implementation, we concluded that terminating these proxies would temporarily solve the problem, as their supervisor would restart them in a consistent state! So, we wrote our killer script that the operations team dutifully executed, and as a result, the transfers were completed.

How much easier would it have been to use a tool that provides ETOP like process listings with an interactive menu that allows users to directly kill processes, and query the process info and the process state? Then we wouldn't have had to write complex Erlang commands to identify, inspect and manage the bad processes.

What happened next? Getting the official patch installed solved the vnode proxy issue. It didn’t solve the resource usage issue, though that is a story for another blog post.

Conclusion

Summing up, operations on production grade Erlang systems is neither easy nor boring. A well-experienced support team equipped with application specific knowledge and Erlang expertise is simply not enough. You need an Erlang centric view of the system, full visibility of the historical metrics and logs, and the drive to use this information to troubleshoot issues during and after outages. This Erlang centric view allows you to quickly figure out what went wrong, fix it and put in place measures ensuring the problem never causes an outage again. Imagine how many person hours we spent troubleshooting this particular issue! Not to mention the financial loss and reputation damage outages and service degradation can cause.

This incident, together with many others, inspired us to create a tool that gives us full visibility and early warnings when things are about to go wrong. It is called WombatOAM and it just reached version 2.2.1:

As is customary in the Enterprise world, we were not granted access to the production servers; all communication went through the operations team, who typed commands in the Erlang shell and sent us the logs we requested. Had we had WombatOAM in this scenario, we could have accessed its GUI, allowing us to quickly troubleshoot the system (in read only mode) and reduce communication time with the operations team. We would have quickly spotted the increase in message queue sizes, and using WombatOAM's ETOP could have inspected the process info, state and message queue. There is no doubt we would have found out that the vnode proxy processes were blocked in a couple of hours. WombatOAM would have further allowed the operations team, through a simple click of a mouse, to trigger the garbage collector on all the Erlang processes, without the risk of typing errors in the shell. The Riak specific plugins would have provided us with fine-grain metrics and a detailed, complete history about various events related to the partitions enabling quick, periodic health-checks.

During the outage, having a central view of all nodes' logs with filtering capabilities would have been a lifesaver, as the important logs could have been easily studied. Detailed memory metrics would have given us insight into the root cause of the Erlang nodes running out of memory. And thanks to WombatOAM's configuration management feature, we would have detected the mis-configuration when an alarm had automatically been raised. The configuration issue could have then been fixed using the GUI, without having to access the remote shell or config files. Customers who give WombatOAM a chance to show its capabilities usually come back to us with success stories about how WombatOAM addressed their problems. Try it out and tell us yours!

References

[1] https://github.com/basho/riak

[2] http://docs.basho.com/riak/kv/2.1.4/using/reference/handoff/

[3] http://docs.basho.com/riak/kv/2.1.4/learn/glossary/#vnode

[4] http://docs.basho.com/riak/kv/2.1.4/setup/planning/backend/bitcask/#disk-usage-and-merging-settings

[5] https://github.com/basho/riak_core/issues/760

Permalink

Erlang/OTP 19.1 has been released

img src=http://www.erlang.org/upload/news/

Some highlights of the release are:

  • erts: Improved dirty scheduler support. A purge of a module will not have to wait for completion of all ongoing dirty NIF calls.
  • erts: Improved accuracy of timeouts on MacOS X.
  • kernel: Add net_kernel:setopts/2 and net_kernel:getopts/2 to control options for distribution sockets in runtime.
  • asn1: Compiling multiple ASN.1 modules in the same directory with parallel make (make -j) should now be safe.
  • httpd: support for PUT and DELETE in mod_esi
  • ~30 contributions since 19.0

You can find the Release Notes with more detailed info at

http://www.erlang.org/download/otp_src_19.1.readme

You can download the full source distribution from http://www.erlang.org/download/otp_src_19.1.tar.gz

Note: To unpack the TAR archive you need a GNU TAR compatible program. For installation instructions please read the README that is part of the distribution.

You can also find the source code at github.com in the official Erlang repository. Git tag OTP-19.1
https://github.com/erlang/otp/tree/OTP-19.1

The Windows binary distributions can be downloaded from

http://www.erlang.org/download/otp_win32_19.1.exe

http://www.erlang.org/download/otp_win64_19.1.exe

You can also download the complete HTML documentation or the Unix manual files

http://www.erlang.org/download/otp_doc_html_19.1.tar.gz
http://www.erlang.org/download/otp_doc_man_19.1.tar.gz


You can also read the documentation on-line here:
(see the Release Notes mentioned above for release notes which
are not updated in the doc, but the new functionality is)

http://www.erlang.org/doc/

We also want to thank those that sent us patches, suggestions and bug reports.

If you find bugs in Erlang/OTP report them via the public issue tracker at http://bugs.erlang.org

The Erlang/OTP Team at Ericsson

Permalink

Erlang & Elixir DevOps From The Trenches - Why we felt the need to formalize operational experience with the BEAM virtual machine

 

Let’s backtrack to the late 90s, when I was working on the AXD301 switch, one of Erlang’s early flagship products. A first line support engineer broke procedure and tried to fix a bug in a live system. They compiled the Erlang module, put it in the patches directory and loaded it in one of the nodes. The patch did not solve the problem, so they deleted the BEAM file but were unaware they had to load the old version again and purge the patched module. So the switch was still running the wrong version of the module. The issue was eventually escalated to third line support, where just to figure out the node was running a different version of the code than originally thought ended up taking a colleague of mine forty hours. All this time was wasted before they could start troubleshooting the issue itself.

Year after year, we came across similar incidents, so I started asking myself how we could formalize these experiences in reusable code. Our aim was to ensure no one would ever have to spend 40 hours figuring out that a first line engineer had not followed procedure. At the same time, I wanted to make sure no one had to reinvent the wheel every time they started on a new project. This is how the idea for WombatOAM was born, a standalone Erlang node that acts as a generic operations and maintenance node (O&M for short) for Erlang clusters. Any system with requirements on high availability should follow a similar pattern and approach. Why not formalize it in reusable code?

Much of what we do here at Erlang Solutions, is developing, supporting and maintaining systems. Every incident we ever experienced which could have been avoided by analyzing symptoms and taking action has been formalized. Take the AXD301 example. WombatOAM will generate a unique module identifier using the md5 digest of the source code in every beam file, omitting information such as compilation date and attributes which do not affect the execution. If two nodes running the same release have different md5 digests of the same module, we raise an alarm that alerts an operator. If a module is loaded or purged in the system, we log it. If something gets typed into the shell, we log it as well. So not only are we alerted that nodes running the same release have different versions of a module, we also have the audit trail which lead to that state for post mortem debugging purposes.

Every system that never stops needs mechanisms for collecting and suppressing alarms, monitoring logs and generating and analyzing metrics. It is one or more subsystem that collects functionality used for monitoring, pre-emptive support, support automation and post-mortem debugging. Applications such as exometerfolsomelarmeperlager and recon will help, but only so far. In the telecom world, this functionality is put in a standalone node to minimize the impact on live traffic, both in terms of throughput and downtime. If the O&M node crashes or is taken offline, the system will still switch calls. This is the operations and maintenance node approach we believe should be adopted by other verticals, as high availability is today relevant to most server side systems. Let’s look at some stories from behind the trenches, and see how a proper O&M system would have reduced downtime and saved $$ in the form of person months of troubleshooting efforts and reduced hardware requirements.

Alarms

I’ve rarely seen the concept of alarms being properly used outside of telecoms. In some cases, threshold based alarms are applied, but that is where it often stops. A threshold based alarm is when you gather a metric (such as memory consumption or requests per second) and raise an alarm if the node on which it is gathered on reaches a certain upper or lower bound. But the potential of alarms goes beyond monitoring thresholds in collected metrics. The concept is easy; if something that should not be happening is happening, an alarm is raised. When issues, maybe on their own accord, through automation (scripts triggered by the alarm) or human intervention revert back to normal, the alarm is cleared. Your database or network connectivity goes down? Raise an alarm and alert the operator as soon as the system detects it. Latency hits the limits of your SLA? Raise an alarm in your system the second it happens, not when the metrics are collected. Or a process message queue (among millions of processes) is growing faster than the process is able to consume the messages? Once again, raise the alarm. If the network or database link comes back up, latency becomes acceptable or the message queue is consumed, the active alarm is cleared.

Processes with long message queues are usually a warning of issues about to happen. They are easy to monitor, but are you doing it? We had a node crashing and restarting over a three-month period at a customer site. Some refactoring meant they were not handling the EXIT signal from the ports we were using to parse the XML. Yaws recycled processes, so every process ended up having a few thousand EXIT messages from previous requests that had to be traversed before the new request could be handled. About once a day, the node ran out of memory and was restarted by hand. The mailboxes were cleared. Our customers complained that at times, the system was slow. We blamed it on them using Windows NT, as we were not measuring latency. We occasionally saw the availability drop from 100% to 99.999% as a result of the external probes running their request right during the crash or when the node was restarting. This was rarely caught, as external probes sent a request a minute that took half a second to process, whilst the node took 3 seconds to restart. So we blamed the glitch on operations messing with firewall configurations. With triple redundancy, it was only when operations happened to notice that one of the machines was running at 100% CPU that we got called in. Many requests going through the system, we thought, but that count was only 10 requests per second. Had we monitored the message queues, we would have picked this the issue immediately. Had we had notifications on nodes crashing, we would have picked the problem up after the event, and had we been monitoring memory usage, we would have seen the cause, leading us to look at the message queue metrics.

Or what about checking for file sanity? There was a software upgrade that required changes of the Erlang application Environment variables. The DevOps team upgraded the sys.config file, copying and pasting from a word document, invisible control characters included. Thank you Microsoft! Months later, a power outage caused the nodes to be rebooted. But because of the corrupt sys.config file, the invisible control characters would not parse and the node would crash in the startup phase. Looking at the sys.config file did not make us any wiser, as the control characters were not visible. It took half a day to figure this one out. We now regularly check and parse all boot, app and config files (as well as any dependencies). If one of them got corrupted or changed manually and might prevent the node from restarting, we will raise an alarm.

I once did an ets:tab2list/1 call in the shell of a live system, and used the data to solve the problem. A few months later, when a similar issue arose, I instructed a first line support engineer to do the same, forgetting that the shell stores the result of its calls. We coped with two copies of the ets table (which happened to store subscriber data for millions of users) but the third caused the system to run out of memory (oh, had we only been monitoring memory utilization!). Today, WombatOAM monitors the actual memory utilisation of the shell process and raises an alarm should it become unproportionately large.

I could go on with war stories for every other alarm we have, describing how some of our customers, not having that visibility, caused unnecessary outages and call-outs. Every alarm we implemented in WombatOAM has a story behind it. Either an outage, a crash, a call from operations, or hours of time wasted looking for a needle in a haystack as we did not have the necessary visibility and did not know where to look. We have 20 alarms in place right now, including checking system limits (many of which are configurable, but you need to be aware you risk reaching them), sanity checks (such as corrupt files, clashing module versions, multiple versions) and the unusual shell history size alarm. I trust you get the point. Oh, and if you are using Nagios, Pagerduty, or other alarming and notification services, WombatOAM allows you to push alarms.

Metrics

Metrics are there to create visibility, help troubleshoot issues, prevent failure and for post mortem debugging. How many users out there with Erlang and Elixir in production are monitoring BEAM’s memory usage, and more specifically, how much memory is being allocated and used by processes, modules, the atom table, ets tables, the binary heap and the code server? How do you know a leak in the atom table is causing the node to run out of memory? Monitor its size. Or what if the cause is long message queues? You should see the used and allocated process memory increase, which leads you to the sum of all message queues. These metrics allow you to implement threshold-based alarms, alerting the DevOps team the node has utilized 75% and 90% of its memory. It also allows you to figure out where the memory went after the node crashed and was restarted.

What about the time the distributed Erlang port hung and requests queued up, causing a net split effect despite all network tests being successful and not raising an alarm? The Distributed Erlang port busy counter would have quickly pointed us to the issue. It is incremented every time a process tries to send data using distributed Erlang, but the port is busy with another job. We hooked up WombatOAM and realised we were getting three million port busy notifications per day! Distributed Erlang was not built to carry the peak load of this particular system. Our customer migrated from distributed Erlang to gen_rpc, and everything worked smoothly ever after.

We once spent three months soak testing a system that, contractually, had to run for 24 hours handling 15,000 operations per second sustained. Each operation consisted on average of four http requests, seven ets reads, three ets writes and about ten log entries to file, alongside all of the business logic. The system was running at an average of 40% CPU with about half a gig of memory left over on each node. After a random number of hours, nodes would crash without any warning, having run out of memory. None of the memory graphs we had showed any leaks. We were refreshing the data at 10-second intervals, showing about 400mb of memory available in the last poll right before the crash. We suspected memory leaks in the VM, looked for runaway non-tail recursive functions, reviewed all the code and ended up wasting a month before discovering that seconds prior to the crash, a very rapid increase of processes with unusually high memory spikes and high activity of garbage collection. With this visibility, provided by the system trace, we narrowed this down to a particular operation which, when run on its own, caused a little spike in the memory usage graph. But when many of these operations randomly happened at the same time, the result was a monster wave that caused the VM to run out of memory. This particular issue took up to 20 hours to reproduce. It kept two people busy for a month trying to figure out what happened. When we finally knew what was going on, it took two weeks to fix it. Wombat today increments counters for processes spending too long garbage collecting, processes with unusually high memory spikes, or NIFs or BIFs hogging the scheduler, blocking other processes from executing and affecting the soft real-time properties of the system. If you suspect there is an issue, you can enable notifications and log the calls causing the issues themselves.

Already using folsom so you do not need WombatOAM I hear? Think again. Out of the box, folsom will give you a dozen metrics. Use folsom or exometer for your business metrics, measuring application specific values such as failed and successful requests, throughput and latency. And let Wombat gather over a hundred VM specific metrics, with a hundred more depending on which applications you are running in your stack. It will take anyone a few days (or hours if you are a hero programmer) to implement a few instrumentation functions to collect metrics from Cowboy, RabbitMQ, Riak Core or Mnesia. It will however take weeks (sometimes months) to figure out what metrics you actually need and optimize the calls to reduce CPU overheads, ensuring there is no impact on the throughput of the system you are monitoring. When looking for a needle in a haystack, after a crash, you never know what metric you are going to need until after the event. If we’ve missed it, be sure we’re going to add it to WombatOAM as soon as one of our customers points it out. So now, you are not only getting the benefit of our experience, but also that of all other customers.

Notifications

Notifications are log entries recording a state change. They help with troubleshooting and post mortem debugging. How many times did you have to investigate a fault not knowing if any configuration changes had been made? In the AXD301 example, we would have logged all modules loaded in the nodes, and any commands executed in the Erlang Shell.

In our first release of WombatOAM, we used to store information on all system notifications, including the distributed Erlang Port Busy notifications and processes with unusually high memory spikes and log garbage collection pauses. Until a customer started complaining that they were running out of disk space. It was the very same customer who was getting three million port busy messages a day. System notifications are today turned off by default, but as soon as you detect you have a problem, you can turn them on and figure out which process has unusually high memory spikes or what function call is sending data to the distributed Erlang port. High volumes of logs, especially business logs, need to be monitored for size. As every system is different, you will have to configure and fine tune what is pushed.

We once had (well, we still do, they are now running WombatOAM) a customer with a several hundred nodes in production, running over thirty different services. The only way for them to notice a node had been restarted was to log on to the server and look for the crash dump file. The only way to know if a process had crashed was to log on to the machine, connect to the shell and start the report browser and search for crash reports. Minor problem. This was an enterprise customer, and developers were not given access to the production machines. Wombat will automatically collect node crash notifications (and rename the crash dump file), as well as pull in error, warning and crash notifications. They are all in one place, so you can browse and search all node restarts and crash reports (and twenty other standard notifications, alongside application specific ones). We handle system logs out of the box, but if you are using logging applications such as the SASL Logger, Lager and the Elixir Logger, you can pull in your business logs and use them to prove your innocence or admit guilt from one location. And if that one location is not WombatOAM, WombatOAM can push them to LogStash, Splunk, DataDog, Graylog, Zabbix or any other log aggregation service you might be using.

But it is only Erlang Code!

Yes, someone who was trialling WombatOAM actually did say this to us, and suggested they implement everything themselves! You can do all of WombatOAM yourself, and add items after outages or as you get more experience of running beam clusters. Figure out what metrics, alarms and notifications to generate, and add them for SASL, Lager, Elixir Logger, Folsom, Exometer, Mnesia, OSmon, Cowboy, Phoenix, Poolboy, Riak KV, Riak core, Yokozuna, Riak Multi-Datacenter Replication, Ecto, RabbitMQ, MongooseIM (all this at the time of writing, more are being added each month). After all, as they said, it is just code. When done, you can spend your time testing and optimizing your instrumentation functions, chasing BIFs which cause lock contention and optimize your nodes so as to reduce memory utilization and CPU overhead. And when you are done with that, you can start working on the tooling side, adding the ability to check and update application environment variables. Run etop like tools or execute expressions from the browser. Or implement northbound interfaces towards other O&M tools and SAAS providers. Or implement the infrastructure to create your own plugins, optimize connectivity between WombatOAM and the managed nodes, optimize your code to handle hundreds of nodes in a single WombatOAM node, or put in place an architecture and test that it will scale to 20,000 nodes. Honestly? If you want to reinvent the wheel, don’t bother. Just get WombatOAM or come work for us instead! (Shameless plug: The WombatOAM team is recruiting).

Wrapping Up

Let’s wrap up with a story from our own teams who pretty much caught the essence of Wombat in an area I had not thought of originally. You know a tool is successful when others start using it in ways you had never thought of. Our original focus was operations and DevOps teams, but developers and testers have since used WombatOAM for visibility and quick turnaround of bugs.

Our MongooseIM team had to stress test and optimize a chat server running on four nodes, with about 1.5 million simultaneously connected users on four machines. As soon as they started generating load, they got the long message queue alarm on one of the nodes followed by the ets table system limit alarm.

Based on this, they figured out Mnesia is the bottleneck, and more specifically, the login operation which uses transactions (hence, the ETS tables). Investigating, they discovered from the graphs that one of the nodes carries twice the load of the other three. They look at the schema and see that tables are replicated on three of the four nodes.

How did the above misconfiguration happen? They looked at the notifications and discovered a logged shell command where the table schema was created using nodes() instead of [node()|nodes()], missing out creating a table on the node they had run the operation on.

They found and solved the problem in 30 minutes, only to have to jump on to the next one, which was an unusually high memory spike occurring during a particular operation. Followed by issues with the versioning of the MongoDB driver (Don’t ask!). Similar issues, when they did not have the same level of visibility, could have taken days to resolve.

In Conclusion

It has long been known that using Erlang, you achieve reliability and availability. But there seems to be some common myth that it all happens magically out of the box. You use Erlang, therefore your system will never fail. Sadly, there is no such thing as a free lunch. If you do not want to kiss your five nines goodbye, the hard work starts when designing your system. Monitoring, pre-emptive support, support automation and post mortem debugging are not things you can with easily just bolt on later. By monitoring, we mean having full visibility into what is going on. With pre-emptive support, we mean the ability to react to this visibility and prevent failure. Support automation allows you to react to external events, reducing service disruption resolving problems before they escalate. And post-mortem debugging is all about quickly and efficiently being able detect what caused the failure without having to stare at a screen hoping the anomaly or crash experienced in the middle of the night happens again whilst you are watching. Whilst five nines will not happen magically, Erlang takes care of a lot of accidental difficulties allowing you to achieve high levels of availability at a fraction of the effort compared to other programming languages. WombatOAM will take care of the rest.

I think the above is the beginning of the end of a blog post, and if you are still reading, trying WombatOAM out is probably worth more than another demo or walk-through. You never know when you are going to need the information and visibility until it is too late, and unless you are troubleshooting issues, the more you see the better.

If you want to read more about systems which never stop, I recommend chapter 16 in my book Designing for Scalability with Erlang/OTP, which covers monitoring and pre-emptive support. You can find it on Safari, BitTorrent or buy it on Amazon or the O’Reilly site. If you use the O’Reilly site, discount code authd gives you 50% off the digital copy and 40% off the printed one. And of course, there is also Erlang in Anger, by Fred Hebert.

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.