Writing, Editing and Formatting a Technical Ebook in 2021

Writing, Editing and Formatting a Technical Ebook in 2021

This post could easily be 20,000 words… there is just so much shit to wade through to get your book looking just so. What do I mean? Here are my concerns, which are simple ones I think. I want:

  • Fonts to be crisp and readable, the line-heights appealing to the eye and perhaps a few flourishes here and there like a drop-cap or small-capped chapter intro
  • Images to layout with proper spacing, a title for accessibility and to be center-aligned
  • Code samples to be formatted with syntax highlighting
  • Callouts for … callouts. Like a colored box with a title maybe.

Things like this are easy for word processing tools like Word or Pages but when it comes to digital publishing, not a chance.

I could fill up paragraphs but I won’t. I’ll just summarize by saying: formatting ebooks is a massive pain. Thankfully I’ve figured a few things out.

Have You Tried?

Yes, I’m 99% sure I have. I’ve used:

  • Scrivener (and I love it). Great for writing, crap for formatting.
  • Markdown turned into HTML and then formatted with Calibre. Works great - I did this with Take Off with Elixir and it looked amazing. Tweaking Calibre and hand-editing the CSS wasn’t so fun, though.
  • Michael Hartl’s Softcover which was stellar at formatting things and looked great but the styling choices were lacking. There were ways to edit things but I’m not a LaTex person. Installation was complex but doable… overall I enjoyed this one the most.
  • A zillion others including iAWriter, Bear, Ulysses, Pages/iBooksAuthor and many others I’m forgetting.

I’ve written 5 books over the last 6 or so years and I’m currently writing 2 more (which I’ll go into in a second). I swear to you I’ve tried just about everything.

When I wrote A Curious Moon I just went back to Word and decided to break the writing process up into two steps: actual writing and then formatting. I knew this is what a lot of writers did, but my process of writing (constantly evolving, “living” ebooks) didn’t lend itself to this. No matter - I can adapt.

It worked out pretty well, too. I wrote everything in Word and then hired an editor. Once editing was done I ported it to InDesign and spent weeks (literally) learning this massive tool.

It was worth it, I think, the book looks amazing…

Writing, Editing and Formatting a Technical Ebook in 2021

The Problem is that it’s laborious and just killed my inspiration. Making edits, for instance, means I have to use InDesign’s ultra crap word editor to fix things, which isn’t fun.

Things get really exciting when InDesign bumps versions and everything is completely wrecked because they changed the way images are resolved (which happened)…

OK, enough complaining! Here’s where I’m at today, with a process I really like.

Writing in 2021

In my last post I mentioned that I was writing a book with Troy Hunt. It’s a fun project in which I’m:

  • Curating posts from his blog that I find interesting
  • Curating comments from those posts and adding them at the end (anonymously)
  • Pushing Troy to write retrospectives on these posts, giving some kind of backstory

Once again I decided to use Word and I wonder if that was the right decision. My thinking at the time was that Troy is a Windows person and I could use OneDrive to share the document with him so he could make comments.

There are problems with this, which include:

  • The document is huge. We’re up to 800+ pages with quite a lot of images. Syncing this thing real time is kind of a bugger.
  • The formatting looks horrendous and trying to explain to Troy “yeah, nah mate I’ll make it look good” is a bit hard. Especially when he replies “I already did that… have you seen my blog?” and I reply “yeah… ummm…”
  • Troy writes in Australian and uses words like “flavour”, “favourite” and “whilst”. Word’s spell checker doesn’t like that and YES I’ve reset it to Aussie English but it doesn’t seem to make a difference. Red squiggles are everywhere!

These are interesting problems to solve! For Troy, his blog is a done thing so my formatting woes are a bit ridiculous. I completely understand this, and I think I blew it by pulling things into Word first.

Writing From Scratch: Ulysses

This might come as a shock, but I find Ulysses to be, hands-down, the best writing experience I’ve ever had. In 2018 my choice was Scrivener:

When it comes to assembling your thoughts and structuring your manuscript, there is nothing that beats Scrivener. The functionality it gives you is astounding, and you can write little snippets or big hunks - its all up to you.
It successfully detaches the idea of text from presentation. You can kick up a manuscript and then compile it for various outputs such as epub, mobi, docx, and pdf. The compiler takes time to get used to, but once you do you can have a serviceable book.

This is still true, the keyword being “serviceable”. Writing in Scrivener is an engineer’s dream as it focuses to completely on the process of writing. The aesthetics of it, however, suck and you end up with something… serviceable:

By “serviceable” I mean the text will show on screen as well as the images, and if you’re lucky maybe some fonts will show up. I played with the compiler for days (literally), trying to get my epub layout flow the way I wanted. Line height, title leading, first paragraph non-indent… all of this is tricky or non-existent with Scrivener.

Ulysses, on the other hand, is pure markdown:

Writing, Editing and Formatting a Technical Ebook in 2021

When you come from Word and Scrivener, this is amazing. I can’t tell you how many times I have to drop into “invisibles” to correct some janky weird formatting/layout issue in both Word and Scrivener. Things get so utterly ugly that I have to stop writing to fix it, which really makes the writing process suck.

With Ulysses, however, I just write in Markdown and I’m a happy person. It’s more than just a Markdown editor - it’s also a writer’s friend with some amazing support tools. The one I really like is the “Review” feature, which loads up your current page to Language Tool. It takes the feedback (which is free) and shows you where corrections are suggested. There are also ways to annotate your text and leave yourself comments, which I also love.

When I’m ready to preview how things will look, there’s a preview button right there that shows my markdown directly in epub. This is fine, if you’re OK with producing something that’s… “serviceable”. But that’s not what I want with Troy’s book.

Formatting for 2021

There are two apps that will format a book to near pixel-perfection for EPub and PDF that don’t come from Adobe and require a master’s degree:

  • Apple Pages (which has absorbed iBooksAuthor)
  • Vellum

Apple Pages will open up a Word document, read the styling, and immediately show you a 99% perfect recreation of your Word document. From that point you can start polishing things up and off you go. It really is a great bit of software, but, oddly, it sucks for writing. Not quite sure why.

The winner for me is Vellum. Check this out:

Writing, Editing and Formatting a Technical Ebook in 2021

I was able to send my text directly from Ulysses into Vellum and this is the formatting I saw. It is perfect. A downside with Vellum is that it doesn’t support syntax highlighting which was one of my requirements :(. Another downside is that the export themes are slightly customizable, but that’s about it. I don’t mind it - it keeps me from going nuts.

I’m OK with doing screenshots for code and then making sure the reader has a link to a GitHub repo with code in it - that’s what I did for A Curious Moon and it worked out fine. It also looks better and, let’s be honest, no one is going to copy/paste code from an ebook - it’s horrible!

The features in Vellum are tremendous and it’s great for formatting a high-end ebook. It can’t do all that InDesign does because InDesign is all about document designing. But I kind of like that - Vellum is focused on wonderful book formatting, putting the focus on your words and content, no more.

I think it will work really well for Troy’s book - but you tell me.

Live Streaming the Formatting Process

I’ll be live streaming the formatting process with Troy on Monday, April 12th at 2PM PDT. We’ll have a few chapters of our book open in Word and I’m going to pull them into Vellum to see what he thinks. If he doesn’t like it, I’ll pull the book into Apple Pages and see what I can put together there.

If he doesn’t like that I’ll have to go back over to InDesign which isn’t the worst thing in the world, but it’s detailed enough that we’ll have to do another full stream.

We’ll also be discussing titles and cover ideas - so please join us! I would love to hear your thoughts. If you want to be updated on our progress I created a mailing list and we’ll be sending out updates when they happen… maybe twice a week.


Erlang: Socket experiments preliminary to writing a web server from scratch

A relative newcomer to networking in Erlang, Dr. Ajay Kumar, has started a self-educational project to create a web server from scratch in Erlang to give himself some first-hand insight into how TCP sockets work in Erlang and how web servers work in general. Web servers are a thing almost everyone has written against or for, but few have tried to implement on their own because socket programming sounds scary and time consuming to learn.

This video is quite short but incidentally demonstrates how not scary socket programming is and how easy it is to experiment with networking concepts on your own. Go experiment! Write networky things! It’s fun!

As an aside to this… I have a two-part explanation video that explains everything that is going on inside the service at he is basing his web server on. In the first part I explain what is going on within the default chat server that ZX templates as a network service project (kind of slow, covers basics for Erlang and ZX newcomers), and in the second part I explain how I used that as a basis for creating a telnet chat service that implements global shouts, channels, permissions, and other basic features needed for a real chat service (a bit better paced if you already know your way around Erlang and discusses some higher-level concepts such as service structure and the “service -> worker pattern“).


Marketing and sales intelligence with Elixir at PepsiCo

Welcome to our series of case studies about companies using Elixir in production. See all cases we have published so far.

PepsiCo is one of the world’s leading food and beverage companies serving more than 200 countries and territories around the world. Today Elixir is used at varying capacities inside PepsiCo by six different teams. This article explores how the Search Marketing and Sales Intelligence Platform teams adopted and use Elixir to build internal tools.

Although we will explore only two teams in this article, PepsiCo is hiring Elixir engineers across multiple teams. Let’s get started.


The first steps

The project that would become the first Elixir project and open the door for future Elixir applications inside PepsiCo was started by Jason Fertel back in 2016.

Initially, the application provided workflow automation for managing search marketing operations on multiple web platforms. The product was a success and ended up integrated into PepsiCo in 2018.

Now, the Elixir application plays a central role in a data pipeline that empowers PepsiCo’s marketing and sales teams with tools to query, analyze, and integrate with several search marketing partners.

The pipeline starts with the Data Engineering team, which collects and stores data into Snowflake Data Cloud. The Elixir application reads data from Snowflake’s platform, pre-process, and stores it in two databases: PostgreSQL or Apache Druid, according to the data characteristics. Finally, a Phoenix application serves this data to internal teams and communicates directly with third-party APIs.

Why Elixir?

Elixir helps PepsiCo eCommerce focus and get things done fast. “Elixir allows our team to develop quickly with confidence,” says David Antaramian, a Software Engineering Manager at PepsiCo. “In turn, that lets us deliver value to the business quickly, and it’s the reason we’ve stuck with the language. Whether it’s streaming changes to the front-end or orchestrating concurrent data operations across multiple storage systems, Elixir offers a robust developer experience that translates to a great consumer experience.”

Different Elixir features came together to help the PepsiCo team build compelling development and user experiences. Thanks to its functional and extensible aspects, PepsiCo used Elixir to create a domain-specific language that translates business queries into data structures sent to different stores. This gave them a stable foundation where they can continually add new queries and integrations, even as they grow in complexity.

Furthermore, the reports generated by PepsiCo’s marketing and sales teams often have to query different tables or even separate storages, all while juggling long-running connections to different third-party APIs. Elixir’s programming model, inherited from the Erlang Virtual Machine, makes it trivial to run all of these operations concurrently, leading to fast and rich user interactions while the development team remains focused on delivering features.

Libraries and frameworks

David Antaramian is quick to praise the Erlang runtime and its standard library. He says: “Since we are working with large amounts of data, it is also essential to avoid hitting the database whenever possible. Thankfully, Erlang ships with an in-memory table storage called ETS, which we use to store hundreds of thousands of rows”.

The Erlang standard library was also handy when communicating to some data stores. In particular, the Snowflake platform requires ODBC connections. The PepsiCo team built a library called Snowflex, designed on top of Erlang’s built-in OBDC drivers.

The Elixir ecosystem nicely complements the Erlang one. The front-end, written in React, talks to the server via the Absinthe GraphQL toolkit running on top of the Phoenix web framework. The Ecto database library manages the communication to PostgreSQL. They also use the esaml and Samly libraries to provide authentication within PepsiCo’s organization - another example of leveraging the tools within both Erlang and Elixir communities.

Finally, the team also recognizes the Erlang Ecosystem Foundation’s efforts, which PepsiCo are sponsors of, particularly the Observability Working Group. David remarks: “The adoption of Telemetry by the ecosystem has been a massive help in bringing monitoring visibility and metrics to our system. Now when we see spikes in one place, we can easily correlate them with other system parts”.


Today there are approximately 40+ Elixir engineers within PepsiCo distributed across six teams. Eight of those engineers are part of the Search Marketing and Sales Intelligence Platform teams.

While the team recognizes that there aren’t as many Elixir engineers compared to communities like JavaScript, they were effective in hiring qualified Elixir candidates. Chase Gilliam, a Software Engineering Manager at PepsiCo, explains: “We have met many engineers that, like us, found Elixir due to being burned out by previous experiences. So when it came to hiring, many Elixir candidates had a mindset similar to ours, which ultimately sped up the process.”

This initial group of Elixir engineers paved the way for the language’s growth inside PepsiCo. David adds: “At first, we looked for engineers with Elixir experience to help us build a team that could guide other developers. Then we extended the pool to anyone who has a functional programming background and now to developers with either Ruby or Erlang experience. However, if someone is the right candidate, we onboard them even if they have no Elixir experience and train them”. He continues: “We also make extensive use of the learning resources available in the community, such as conferences, books, online courses, and others.”

As the team grew, they adopted best practices and saw the quality of the codebase improve at the same time. Chase concludes: “At the beginning, there were some large modules in our application. Luckily, refactoring in a functional programming language is straightforward, thanks to immutability and limited side-effects. Adopting tools like Credo, ExDoc, and the code formatter was also essential to standardize how we use Elixir internally.” For those interested in learning more about the different use cases for Elixir inside PepsiCo and help continue its growth, they are hiring.


Libros, Youtube y Twitch

Libros, Youtube y Twitch. Son los proyectos potenciados donde más esfuerzo estoy poniendo durante mi tiempo libre publicando libros como la traducción al inglés del libro de Erlang/OTP o el nuevo libro sobre Phoenix Framework, pero, ¿qué más hay en Youtube y Twitch?


OTP 24.0 Release Candidate 2

img src=http://www.erlang.org/upload/news/

This is the second of three planned release candidates before the OTP 24 release.
The intention with this release is to get feedback from our users. All feedback is welcome, even if it is only to say that it works for you.

We encourage users to try it out and give us feedback either by creating an issue here https://github.com/erlang/otp/issues
or by posting to the mailing list erlang-questions@erlang.org.

Erlang/OTP 24 is a new major release with new features, improvements as well as a few incompatibilities. Some of the new
features are highlighted below.


Highlights rc2


  • The compiler will now inline funs that are used only once immediately after their definition.

erts, kernel, stdlib

  • hex encoding and decoding functions added in the binary module

There is as usual a number of bug fixes and improvements detailed in the readme.



Highlights rc1

erts, kernel, stdlib

  • The BeamAsm JIT-compiler has been added to Erlang/OTP and will give a significant performance boost for many applications.
    The JIT-compiler is enabled by default on most x86 64-bit platforms that have a C++ compiler that can compile C++17.
    To verify that a JIT enabled emulator is running you can use erlang:system_info(emu_flavor).

  • A compatibility adaptor for gen_tcp to use the new socket API has been implemented (gen_tcp_socket).

  • Extended error information for failing BIF calls as proposed in EEP 54 has been implemented.

  • Process aliases as outlined by EEP 53 has been introduced.


  • Compiler warnings and errors now include column numbers in addition to line numbers.
  • Variables bound between the keywords 'try' and 'of' can now be used in the clauses following the 'of' keyword
    (that is, in the success case when no exception was raised).


  • Add support for FTPES (explicit FTP over TLS).


  • Support for the "early data" feature for TLS 1.3 servers and clients.
  • Make TLS handshakes in Erlang distribution concurrent.


  • The application has been completely rewritten in order
    to use wxWidgets version 3 as its base.
  • Added support for wxWebView.


  • EDoc is now capable of emitting EEP-48 doc chunks. This means that, with some configuration, community projects
    can now provide documentation for shell_docs the same way that OTP libraries did since OTP 23.0.


For more details about new features and potential incompatibilities see

Pre built versions for Windows can be fetched here:

Online documentation can be browsed here:

The Erlang/OTP source can also be found at GitHub on the official Erlang repository,


OTP 23.3 Release

img src=http://www.erlang.org/upload/news/

Erlang/OTP 23.3 is the third and final maintenance patch package for OTP 23, with mostly bug fixes as well as a few improvements.

A full list of bug fixes and improvements in the readme.

Download and documentation

Online documentation can be browsed here:

Pre-built versions for Windows can be fetched here:

The Erlang/OTP source can also be found at GitHub on the official Erlang repository:


Surfacing Required Knowledge


Surfacing Required Knowledge


When I first started this blog, I had it in mind that I'd only write about things that were discrete and verifiable. I wanted to avoid doing the bullshit calls of being an Intellectual in Silicon Valley, or say to casting entire industries through one reductive lens (coming to mind is the old post from Steve Yegge on the now defunct google+ about framing software engineering as political affiliation), or god forbid, writing a blog post about "how to be a good <role>" which transparently is a text that just says "here's how to be more like me."

Unfortunately, this text is about to break a crapton of these early rules. If at any point you feel "gosh this sounds like SV-style thought leadership", please feel free to kneecap me on twitter to get me back into a reasonable place.

I've learned that it's sometimes useful to take a specific lens under which we can analyze dynamics and see where the idea pushes us, how it surfaces new perspectives, with some of them being useful and worth keeping around. Social sciences do that stuff under ideas of gender or race studies, for example, and semiotics do it around signs and how they communicate information to interpreters. Other lenses could include things such as studying social graphs and communication patterns, or advice like "follow the money."

In this post, I'm taking such a lens and trying to apply it to my experience in the tech industry to see how well it explains a few things. My perspective is going to be limited, this post is de-facto going to contain a lot of bullshit that will not ring true to readers, and I will try and be careful to keep this as a way to consider alternative viewpoints to add perspectives, rather than as a new framework that ignores the existing richness of things to provide a blunt high-level metric that causes more damage than anything. It's more than likely that all of that stuff already exists and is better studied than what I came up with as well, in a discipline that existed longer than I've been alive but that I'm unfortunately unaware of. There's also a bunch of it that's just me digesting ideas from smarter people and re-wording them clumsily.

Also final words of caution: I am going to arbitrarily use terms like "knowledge", "expertise", "experience", and "skills", and do so loosely and interchangeably. I am also going to do the same thing with terms like "education", "training", and "teaching". I am using these terms to refer to a general concept of attributes we believe people need to perform up to our expectations and the means by which we can create or augment these attributes in people.

Externalizing Required Knowledge

One of the good presentations I've seen people refer to a lot has been the one at boringtechnology.club, Choose Boring Technology. The article always made a lot of sense to me, but also annoyed me because I like Erlang a lot, and Erlang is not boring technology. It is a non-commoditized ecosystem where you can't easily reach out and grab 50 senior devs for market prices. On the other hand, I have also seen how trying to do polyglot organizations right can highlight a lot of blind spots that exist within an organization that feels safe using a single language. I've also seen cases where people stretched PostgreSQL in ways that were extremely uncomfortable, and where the same kind of wizardry was required for a mainstream tool as a niche one, and you couldn't easily hire 50 senior devs at market prices for the skill set required either.

Companies often choose technologies with the hope of easily finding a workforce for it, one that is commoditized. Put the job ad, get seniors with 5+ years of experience on 3 years old technology. Nobody knows where or how they were trained. They're just there, ripe for the picking, part of the environment. Countless companies would rather spend extra months looking for the proper type of seniors with the proper ability to answer the right whiteboard questions—which mostly suck at predicting how good they'll be on the job—than they would trying to train their employees up to the level they expect.

On the other end of the relationship, workers enter a perpetual rush to stay up to date on specific technologies, with the hopes of remaining employable. People are stuck playing a divination game in hopes to bet on the right stack with the right skill set such that they'll check all the boxes in a job ad that sounds like a Christmas wishlist in a letter to Recruiter Santa. Where do developers find the time to keep up to date? Free time, mostly. They do it at night, or maybe on stolen time from employers. A few are lucky enough to pick it up as they go while paid for it as their main duty, usually by their job accepting them to be slower while they figure shit out with a book they've been given to help them along the way.

This cycle of building on newer stacks all the time to have a more easily hireable workforce whose concern is to always learn newer stacks to remain employable accelerates in a painful way, churning through burnouts and stacks. People avoid roles in stacks that are seen as less trendy because they're fearing it will hold them back, and yesterday's new hip thing is now the radioactive legacy of tomorrow.

The ever-growing wage inflation in tech, which has been going on for way longer than I'd have expected, likely does not help either. Average tenure at most organizations remains short as people know and expect to get far more significant raises from switching roles than waiting for internal promotion or salary adjustment cycles. Hell, it's likely hard for many to get an adjustment of their income that matches the rate of inflation when keeping their job, but easy getting wage increases in ranges as high as 30%-70% by switching roles. This is compounded when many employers have salary ranges attached to seniority ladders, and that to keep hiring new talent at market rates without raising existing payroll costs, people with less experience get hired at more senior levels while your more senior people's levelling slows down. Hopping jobs becomes a baseline strategy in the industry, and people who are happy in their roles can see themselves at a severe disadvantage in terms of opportunity for sticking around too long.

It feels unsustainable. It wasn't always that way, and still does not need to be so. I can't comment on the feasibility of more aggressively giving raises to your current staff to keep on with market rates for most employers in order to retain talent. On the other hand, there are strategies that can be adopted to better cope with an expected churn in your workforce. These strategies are useful at all sizes; large corporations with significant workforce and complex hiring efforts, and growing startups whose history is held in mind by an ever-shrinking faction of veterans.

On-the-job training, where an employer takes on the duty of training the people they hire, used to be one of the most popular ways of doing things, usually with the help of mentorship and apprenticeship. Companies that historically were truly innovative actually had no choice but to work this way. If they were absolute leaders, they were the only ones able to show others how to do things for them. You couldn't expect to reach excellence without having ways to foster it; otherwise what you could do is hire the offshoots of places that knew how to do it, and follow behind them like seagulls trailing a fishing boat.

On-the-job training brings up images of taking people unskilled in your domain off the street, giving them equipment, and making them good. It feels onerous, long, and ineffective. Nevermind that training or teaching well is a skill that isn't always aligned with a person's main role. That being said and without judging on the possibility of doing it all-or-nothing like that, it brings up the possibility of picking arbitrarily high requirements:

  • must know our process, stack, and methodology to an advanced level in order to join
  • must know our process, stack, and methodology to an acceptable level in order to join
  • must be familiar with our process and stack in order to join, and at least be proficient in one of our main technologies
  • must be familiar with the technologies we use
  • must be familiar with technologies resembling ours or have worked in similar stacks
  • doesn't need to know much about what we do

Those are often what companies aim to screen for when they hire, as made explicit by the lists of requirements in their job ads. They also pile on additional requirements around education, ambitions, "culture fit", experience, domain knowledge, ability to perform under pressure, and so on. My point is: when hiring someone, we set a base level we expect, and then commit to leaving the new hire enough time to ramp up and close the gap, or more ideally we commit to training them until they reach or exceed expectations.

When we raise the bar of hiring without changing anything else, we tacitly externalize the cost of levelling up people to the ecosystem at large: universities, open source communities, bootcamps, competitors, and other organizations within the industry. We expect the experience and expertise to be gained elsewhere, hopefully saving us the costs in the process.

So here's the new lens: what would be required of organizations if we wanted to actually lower the bar of hiring, and we took it on ourselves to close the gap between what we expect of high performers and the situation they're in when we hire people? Could we quantify and qualify that?

Surfacing Required Knowledge

A scary aspect of this externalization is that—like a lot of things we externalize or commodify—our dependencies become invisible. Open source software is a bit the same: those who treat it like a free buffet without regards to the sustainability of libraries they use can suddenly become surprised when the external actors (the maintainers) vanish or move on to other things. Two reactions feel safest:

  1. You use only things that are safely externalized to avoid paying any further cost
  2. You start participating in their ecosystems to make sure they remain sustainable.

To me, this is the rough edge to "use boring technology". Using boring technology sort of aligns you with safer externalization of software, but it reaches its limits when you start stretching common pieces of technology when your use cases are not aligned with their intended or most typical uses. Is it better to take a given database you know already and use it in weird ways only your team knows, or to add a new different storage mechanism that is used the way everybody else uses it and for which you will have lots of resources? At some point, the choice is not so obvious.

That choice is simple enough for open-source work, but it's not really an option for education and expertise. Some of the things your organization does are only going to be known by your organization. This includes domain knowledge, but also understanding your own software and organization's history. These things turn out to be far more significant than expected, and if you are not aware of it or taking means to manage it, you're left using a haphazard strategy, in a position fragilized by surprises.

If you're providing software as a service or operating a platform for your customers, one of the most revealing questions to ask is "how could we let customers run this on-premises?" Take your whole stack, and assume you're shipping it to someone whose workforce does not have your own team's experience and knowledge. They have to run it, operate it, apply updates, everything. How would you close the gap? For most places I've worked at, this was generally unthinkable: there's too many skeletons, too many gotchas, bits we're not proud of but are still there, complex interactions you can't manage well without deep knowledge of internals, and so on.

A system is its rolled up history of adaptations to its experiences and expectations of stressors and pressures.
    - David Woods

You are dependent on a living experience embodied by your team, and it's near impossible to divorce your system's ongoing success from its people. The few places that can afford to say they can usually have built everything with the expectation that operators are going to be external, and the knowledge required to run things will need to be made explicit in order to let others run it all.

Ask similar questions about your staff: if a given employee were to leave tomorrow with no knowledge transfer, how much trouble would we be in? Can we simulate that by making them take surprise paid vacation for a week? What would come up? Is there glue work they do and we don't realize?

Are there parts of our system where we don't really know what good results are supposed to look like, and we're mostly relying on things keeping on keeping on the way they are? Who knew what "good behaviour" for the system was supposed to be when we added these features, and what happened since then? What are some good war stories in your group, and how are they being passed on?

These questions are sure to identify conceptual gaps where we have hard dependencies on what our people know but are very likely not tracking it for real nor preparing to transfer that knowledge, what do we do? I can think of two families of approaches: one that is structuring and explicit, and one which is about fostering the right conditions for things to happen.

An explicit structuring approach would have you map out all the things we need, or at least the important ones. Draw out lists of requirements, find the odd ones out you never really use or no longer have. What did the last person we had who knew how to run the legacy stack know? What are the weird things we do to keep things running? Some of these questions can be asked directly. These will likely have to do with the things we know we need when things go well: how we write code, test it, build it, deploy it, look to know things are fine. Which dashboards or queries or logs people look at. They're generally the things you might put in your on-boarding documents, but they rot fast.

Your tell-tale signs of that would be having internal documentation (that is up to date), internal bootcamps, a library of presentations and tutorials to help people level up, the equivalent of game days and simulator hours in airline pilots where you can train and get acquainted with all the complexity as part of you joining in. This has a cost, and the weight of this structure creates a rigidity that can limit your own adaptiveness; you need to be able to undo and adjust all of that material constantly to keep it useful, rather than treat it like precious ruins no one should disturb.

For many of the explicit structured approaches, you won't get the important stuff by just asking. Most of it is tacit knowledge that only shows up when things break. And then you will see your local experts identify the failure mode, reason about how it happened, and find ways to correct course for the system and make things work again. That is the sleeping expert knowledge that gets invoked in times of need but that we otherwise never recall explicitly. It requires time and lots of observations to catalog that stuff.

Fostering Expertise

I've mentioned earlier that I'd be loose with terminology. Here it makes sense to force a distinction between knowledge and expertise. Knowledge can be the things you know such as facts and strategies, context and history around decisions. They're somewhat easy to transmit between people. I'll define expertise as trickier, and relying on experience to easily and correctly make use of knowledge on a contextual basis. It's the difference between knowing the rules, and when to break them. It's figuring out what information not being there could mean, and how to adjust.

I believe that making information with which you build knowledge explicit can work for foundational stuff, but it's not going to be workable everywhere, and you will hit diminishing returns. Instead (or on top of some structure), there are habits you can take that will make knowledge transfer and skill improvement part of your culture in a way that is compatible with gaining experience faster as well.

Your people's expertise will show up when encountering novel situations that will challenge them. When shit hits the fan, you will find them finding ways to mitigate situations, buy time, re-evaluate the problem space, create and disprove tons of hypotheses, be surprised, and collaborate on all of it until a solution can be found. Those are going to be defining moments where teammates help each other level up.

Much like a capacity to adapt or good cardio, expertise is not something you have as much as something you do. Make sure everyone gets to exercise and be in contact with it. Walk the problem space and get people to talk to each other in ways that synchronize their mental models and experience. Provide ways for people to get quick feedback for their actions, both from mentors and from observing the results of their actions.

This can come from activities such as in-depth incident investigations (not "action item factories"), chaos engineering, code reviews that aim to spread ownership, apprenticeships, lunch and learns, "war stories", or other presentations for direct dissemination, and contrasting approaches that can bring broader perspectives. Good examples of the latter can be book clubs with discussions from people in similar roles but in other teams, or getting your customer support people on dev teams and engineering staff on customer support rotations.

These are concepts likely tied to symmathesy, about which Jessica Kerr has written great articles.

I suspect that both the structured knowledge and expertise fostering approaches work best in a pair when they can feed off of each other. Having them can show that the organization values learning and teaching, and increase internal mobility and adaptiveness. Use both, and turn them into a self-reinforcing loop.

Sustainable Externalization

Some stuff we won't be able to afford internalizing. There's a lot of complexity in our stacks, building our own corporate universities sounds less than ideal. In fact, the software industry is often seen as rather unique in how its workers act as a community, and I would hate losing that. Hillel Wayne's Crossover Project mentions:

We software engineers take the existence of nonacademic, noncorporate conferences for granted. But we’re unique here. In most fields of engineering, there are only two kinds of conferences: academic conferences and vendor trade shows. There’s nothing like Deconstruct or Pycon or !!con, a practitioner-oriented conference run solely for the joy of the craft.

What we do as a workforce is cope for the bad patterns of so many workplaces, and turn it into a strength. I'm someone without a formal computer science education that directly benefited from that openness, so I see it as rather critical. It also lets people go past the limits of what their local education system can offer, which might either be fantastic or terrible, and can act as a democratizing force.

However, implicit limits that come from "training on your own time" perpetuate structural privilege. The only reason I could do as much as I did without a formal education was that I was in a life situation that made it possible for me to learn a different language, work full time, study part time, write a technical book, and travel for work all at once. I've been fortunate in being in a context that isn't available to everyone: education, costs of living, dependants (children or less autonomous relatives), mental and physical health factors, work schedules and proximity, and all other sorts of factors where discrimination can come up can all be structural blockers that most people having to learn in their free time won't have to face on equal footing.

If we are to externalize knowledge and expertise to a broader ecosystem, I would advocate doing it sustainably: participate in the ecosystem, and make it so your people can do so during work hours. If you see a huge benefit from using the training provided by others (and their open-source code, while we're at it), find ways to take some of these savings and reinvest them to the community, even if it's just a tiny fraction of what you saved. Big industry players have already realized that their devrel and marketing pipelines can benefit from it, and smaller ones with a keen eye already are involved in local user groups.

Get your shoulder to the wheel and turn from consumer to participant, both for the sake of your employees, but also for the sake of your own sustainability.


A few notes on message passing

Message passing has always been central to Erlang, and while reasonably well-documented we’ve avoided going into too much detail to give us more freedom when implementing it. There’s nothing preventing us from describing it in a blog post though, so let’s have a closer look!

Erlang processes communicate with each other by sending each other signals (not to be confused with Unix signals). There are many different kinds and messages are just the most common. Practically everything involving more than one process uses signals internally: for example, the link/1 function is implemented by having the involved processes talk back and forth until they’ve agreed on a link.

This helps us avoid a great deal of locks and would make an interesting blog post on its own, but for now we only need to keep two things in mind: all signals (including messages) are continuously received and handled behind the scenes, and they have a defined order:

Signals between two processes are guaranteed to arrive in the order they were sent. In other words, if process A sends signal 1 and then 2 to process B, signal 1 is guaranteed to arrive before signal 2.

Why is this important? Consider the request-response idiom:

%% Send a monitor signal to `Pid`, requesting a 'DOWN' message
%% when `Pid` dies.
Mref = monitor(process, Pid),
%% Send a message signal to `Pid` with our `Request`
Pid ! {self(), Mref, Request},
    {Mref, Response} ->
        %% Send a demonitor signal to `Pid`, and remove the
        %% corresponding 'DOWN' message that might have
        %% arrived in the meantime.
        erlang:demonitor(Mref, [flush]),
        {ok, Response};
    {'DOWN', Mref, _, _, Reason} ->
        {error, Reason}

Since dead processes cannot send messages we know that the response must come before any eventual 'DOWN' message, but without a guaranteed order the 'DOWN' message could arrive before the response and we’d have no idea whether a response was coming or not, which would be very annoying to deal with.

Having a defined order saves us quite a bit of hassle and doesn’t come at much of a cost, but the guarantees stop there. If more than one process sends signals to a common process, they can arrive in any order even when you “know” that one of the signals was sent first. For example, this sequence of events is legal and entirely possible:

  1. A sends signal 1 to B
  2. A sends signal 2 to C
  3. C, in response to signal 2, sends signal 3 to B
  4. B receives signal 3
  5. B receives signal 1

Luckily, global orders are rarely needed and are easy to impose yourself (outside distributed cases): just let all involved parties synchronize with a common process.

Sending messages

Sending a message is straightforward: we try to find the process associated with the process identifier, and if one exists we insert the message into its signal queue.

Messages are always copied before being inserted into the queue. As wasteful as this may sound it greatly reduces garbage collection (GC) latency as the GC never has to look beyond a single process. Non-copying implementations have been tried in the past, but they turned out to be a bad fit as low latency is more important than sheer throughput for the kind of soft-realtime systems that Erlang is designed to build.

By default, messages are copied directly into the receiving process’ heap but when this isn’t possible (or desired – see the message_queue_data flag) we allocate the message outside of the heap instead.

Memory allocation makes such “off-heap” messages slightly more expensive but they’re very neat for processes that receive a ton of messages. We don’t need to interact with the receiver when copying the message – only when adding it to the queue – and since the only way a process can see a message is by matching them in a receive expression, the GC doesn’t need to consider unmatched messages which further reduces latency.

Sending messages to processes on other Erlang nodes works in the same way, albeit there’s now a risk of messages being lost in transit. Messages are guaranteed to be delivered as long as the distribution link between the nodes is active, but it gets tricky when the link goes down.

Using monitor/2 on the remote process (or node) will tell you when this happens, acting as if the process died (with reason noconnection), but that doesn’t always help: the link could have died after the message was received and handled on the other end, all we know is that the link went down before we got any eventual response.

As with everything else there’s no free lunch, and you need to decide how your applications should handle these scenarios.

Receiving messages

One might guess that processes receive messages through receive expressions, but receive is a bit of a misnomer. As with all other signals the process continuously handles them in the background, moving received messages from the signal queue to the message queue.

receive searches for matching messages in the message queue (in the order they arrived), or waits for new messages if none were found. Searching through the message queue rather than the signal queue means it doesn’t have to worry about processes that send messages, which greatly increases performance.

This ability to “selectively receive” specific messages is very convenient: we’re not always in a context where we can decide what to do with a message and having to manually lug around all unhandled messages is certainly annoying.

Unfortunately, sweeping the search under the rug doesn’t make it go away:

    {reply, Result} ->
        {ok, Result}

The above expression finishes instantly if the next message in the queue matches {reply, Result}, but if there’s no matching message it has to walk through them all before giving up. This is expensive when there are a lot of messages queued up which is common for server-like processes, and since receive expressions can match on just about anything there’s little that can be done to optimize the search itself.

The only optimization we do at the moment is to mark a starting point for the search when we know that a message couldn’t exist prior to a certain point. Let’s revisit the request-response idiom:

Mref = monitor(process, Pid),
Pid ! {self(), Mref, Request},
    {Mref, Response} ->
        erlang:demonitor(Mref, [flush]),
        {ok, Response};
    {'DOWN', Mref, _, _, Reason} ->
        {error, Reason}

Since the reference created by monitor/2 is globally unique and cannot exist before said call, and the receive only matches messages that contain said reference, we don’t need to look at any of the messages received before then.

This makes the idiom efficient even on processes that have absurdly long message queues, but unfortunately it isn’t something we can do in the general case. While you as a programmer can be sure that a certain response must come after its request even without a reference, for example by using your own sequence numbers, the compiler can’t read your intent and has to assume that you want any message that matches.

Figuring out whether the above optimization has kicked in is rather annoying at the moment. It requires inspecting BEAM assembly and even then you’re not guaranteed that it will work due to some annoying limitations:

  1. We only support one message position at a time: a function that creates a reference, calls another function that uses this optimization, and then returns to receive with the first reference, will end up searching through the entire message queue.
  2. It only works within a single function clause: both reference creation and receive need to be next to each other and you can’t have multiple functions calling a common receive helper.

We’ve addressed these shortcomings in the upcoming OTP 24 release, and have added a compiler option to help you spot where it’s applied:

$ erlc +recv_opt_info example.erl



t(Pid, Request) ->
    %% example.erl:5: OPTIMIZED: reference used to mark a 
    %%                           message queue position
    Mref = monitor(process, Pid),
    Pid ! {self(), Mref, Request},
    %% example.erl:7: INFO: passing reference created by
    %%                      monitor/2 at example.erl:5

await_result(Mref) ->
    %% example.erl:10: OPTIMIZED: all clauses match reference
    %%                            in function parameter 1
        {Mref, Response} ->
            erlang:demonitor(Mref, [flush]),
            {ok, Response};
        {'DOWN', Mref, _, _, Reason} ->
            {error, Reason}


Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.