Example-based Tests And Property-based Tests Are Good Friends

I mostly use property-based testing to test stateless functional code. A technique I love to use is to pair property-based tests together with example-based tests (that is, “normal” tests) in order to have some tests that check real input. Let’s dive deeper into this technique, some contrived blog-post-adequate examples, and links to real-world examples.

Permalink

Let's Draw a Line!

Now that we’ve got the tools we need to create and save an image, and a base framework to build additional functionality on top of, let’s turn to the next important part of a graphics library: being able to draw something on to the image.


This post is part of a series:


Let’s start very simple, with the humble straight line.

cairo provides two functions for drawing a straight line:

  • cairo_line_to
  • cairo_rel_line_to

Each takes two arguments, floats representing the x and y coordinates of the destination point. The difference lies in how they translate those coordinates, and that difference is quite possibly revealed by the names of the functions. cairo_line_to draws a line to the given coordinates interpreted in absolute space, while cairo_rel_line_to interprets the given coordinates as deltas to be applied relative to the origin point.

Speaking of origin points, you’ll have noticed that each of these functions takes only one set of coordinates, the end point of the line. So how do we know where to start drawing?

As part of its state, the in-memory Cairo context keeps a record of its “current point”. This point starts out at (0, 0), and is updated every time a line (or other shape) is drawn on the image surface, or by using two related functions: cairo_move_to and cairo_rel_move_to. As you might imagine, these act much the same way as the _line_to functions above, but simply update the “current point” without rendering a line onto the surface.

Two other functions it will be useful to know about before diving into implementing this functionality in Xairo, are cairo_stroke and cairo_fill. These are used to complete a path built of _line_to, _move_to and other related functions, and then render it to the image surface either by rendering the path as a single line, or attempting to fill in the path, respectively. These functions are related to paint which we already saw implemented in a previous post. For now, though, since we’re only drawing straight lines, we’ll only worry about _stroke.

So let’s look at the functions we need to implement. For now, let’s leave aside the _rel_*_to functions, since their implementation will be very close to the absolute coordinate functions, and also _fill, since it will be a very small and predictable variation on _stroke. So that leaves us with:

  • line_to
  • move_to
  • stroke

On the Elixir side

In C and Rust, the line_to and move_to functions take 2 arguments, the x and y coordinates as floats. This is fine, and we could implement Xairo.line_to and Xairo.move_to to take 2 arguments (in addition to the image resource) without an problem. But in order to make Xairo as user-friendly as possible, wouldn’t it be nice to provide, through function signatures, some context for those two floats? What if we wrap them in a very simple Point struct?

defmodule Xairo.Point do
  defstruct [:x, :y]

  def new(x, y) do
    %__MODULE__{
      x: x * 1.0,
      y: y * 1.0
    }
  end
end

then we can write Xairo.line_to and Xairo.move_to, which take a single, contextualized, Point rather than two unmarked float values.

Now we have a choice. We can either have our NIF for line_to accept x and y, and unpack the Point in Xairo.line_to, or we could keep that context through the NIF into Rust, and unpack the coordinates only when it’s time to pass them into cairo. I prefer the second option, maintaining struct contexts as much as possible through the API code, both in Elixir and Rust, and unwrapping them only when it’s necessary to communicate with the original C API.

This means that for Xairo.line_to the code we write in Elixir, along with the Point struct definition from above, looks like

defmodule Xairo.Native do
  ...
  def line_to(_image, _point), do: :erlang.nif_error(:nif_not_loaded)
end

defmodule Xairo do
  ...
  def line_to(%Image{} = image, Point{} = point) do
    Xairo.Native.line_to(image.reference, point)
    image
  end
end

How cool is that? Barely any code at all. As we’ll see next, things are pretty much the same over in Rust

On the Rust side

In the Ruby code above, we saw that instead of passing two floats into Rust as the coordinates, we’re passing an Elixir struct that we’ve created. As I alluded to in an earlier post,1 rustler actually makes passing structured data between Elixir and Rust very simple by providing a NifStruct trait. So we define our Rust struct

#[derive(Copy,Clone,Debug,NifStruct)]
#[module = "Xairo.Point"]
pub struct Point {
    pub x: f64,
    pub y: f64
}

We derive the NifStruct trait and add an annotation matching it to the struct’s name in Elixir, and that is all we have to do to be able to use the struct directly in our NIF

#[rustler::nif]
fn line_to(context: WrapperRA, point: Point) -> WrapperRA {
    context.context.line_to(point.x, point.y);
    context
}

We implement move_to and stroke in the same way, and then we should be able to draw some lines on our image, right?

Well, just about. Except for paint, none of our functions deal with color. So how to do we tell cairo what color to use when it draws a line?

Our NIF for paint cheated just a little bit by wrapping two cairo functions togther. It calls set_source_rgb and paint. This was fine a little while ago because we only had one function that could render color on to the canvas. But now we’re adding more, so let’s surface set_source_rgb out into its own NIF.2 While we’re at it, let’s provide an RGB struct context for those values like we did with Point. The implementation of this struct in Elixir and Rust will be left as an exercise for the reader.

At the end of this, our Xairo Elixir module looks like

defmodule Xairo do
  ...
  def line_to(%Image{} = image, Point{} = point) do
    Xairo.Native.line_to(image.reference, point)
    image
  end

  def move_to(%Image{} = image, Point{} = point) do
    Xairo.Native.move_to(image.reference, point)
    image
  end

  def set_source_rgb(%Image{} = image, %RGB{} = rgb) do
    Xairo.Native.set_source_rgb(image.reference, rgb)
    image
  end

  def paint(%Image{} = image) do
    Xairo.Native.paint(image.reference)
    image
  end

  def stroke(%Image{} = image) do
    Xairo.Native.stroke(image.reference)
    image
  end

and Rust like

#[rustler::nif]
fn line_to(context: WrapperRA, point: Point) -> WrapperRA {
    context.context.line_to(point.x, point.y);
    context
}

#[rustler::nif]
fn move_to(context: WrapperRA, point: Point) -> WrapperRA {
    context.context.move_to(point.x, point.y);
    context
}

#[rustler::nif]
fn set_source_rgb(context: WrapperRA, rgb: RGB) -> WrapperRA {
    context.context.set_source_rgb(rgb.red, rgb.green, rgb.blue);
    context
}

#[rustler::nif]
fn paint(context: WrapperRa) -> WrapperRa {
    context.context.paint().unwrap();
    context
}

#[rustler::nif]
fn stroke(context: WrapperRa) -> WrapperRa {
    context.context.stroke().unwrap();
    context
}

A nice 1-to-1 mapping of the API, with our structured data kept contextualized until it needs to be destructed to pass to the native API.

Let’s end by putting it all together and drawing an empty white image with two horizontal black lines:

Xairo.Image.new(100, 100)
|> Xairo.set_source_rgb(RGB.new(255, 255, 255))
|> Xairo.paint()
|> Xairo.set_source_rgb(RGB.new(0, 0, 0))
|> Xairo.move_to(Point.new(20, 20))
|> Xairo.line_to(Point.new(80, 20))
|> Xairo.move_to(Point.new(20, 80))
|> Xairo.line_to(Point.new(80, 80))
|> Xairo.stroke()
|> Xairo.save_image("lines.png")

and that gives us

lines.png

just as we hoped

Next time we’ll look at some of that duplication creeping into Xairo, and talk about next steps for the library. Thanks for reading!


Footnotes

  1. In that previous post, I decided against using this option because of the complexity of the data we wanted to pass in the struct. Here, where we’re only passing two floating point numbers, it is much simpler. (for the Rust-ier among you, this approach is possible as long as the field types all implement the Copy trait.) 

  2. We could implement stroke (and later, fill) similarly, where they take the color and set it before rendering. Because of how cairo works under the hood, you can set the color at any point during a path’s creation/extension as long as it’s before you call stroke. But, from a workflow perspective, it is more natural when drawing to select the color first, so being able to separate these functions lends itself to a more familiar workflow when using Xairo

Permalink

Elixirizing our Rust

If you recall from the last post, we succeeded in our goal of generating some art. Or something resembling art, at least. A stepping stone on the way to art, let’s say. We’re not going to get any more artful in this post, but instead we’ll be taking time to clean up what we’ve accomplished so far to provide a solid base to build a full library off of.


This post is part of a series:


So, where did we leave off?

The library as it stands

Right now we’ve got the Xairo.Native module that provides the function signatures we can map to our Rust NIFs:

defmodule Xairo.Native do
  use Rustler, otp_app: :xairo, crate: "xairo"

  def new(_, _), do: :erlang.nif_error(:nif_not_loaded)
  def paint(_, _, _, _), do: :erlang.nif_error(:nif_not_loaded)
  def save(_, _), do: :erlang.nif_error(:nif_not_loaded)
end

Except for new/2, each of these functions takes as its first argument an Elixir Reference that maps to an in-memory Rust struct that holds the cairo ImageSurface and Context objects:

pub struct CairoWrapper {
  pub context: Context,
  pub surface: ImageSurface
}

Functions in Rust accept the reference and return it, allowing us, as we saw in the previous post, to chain functions together with the |> operator in a more Elixir-y way, but the fact remains that we’re passing around a Reference struct which is, to put it mildly, useless, when it comes to understanding what that reference represents.

So let’s create a struct that gives us some basic information about the image, along with storing the image’s reference

Xairo.Image to the rescue

defmodule Xairo.Image do
  defstruct [:width, :height, :reference]

  def new(width, height) do
    reference = Xairo.Native.new(width, height)
    %__MODULE__{
      width: width,
      height: height,
      reference: reference
    }
  end
end

Ok, great, we’ve a struct that tells us, for now, about the width and height of our image, and provides the reference to the in-memory image so that we can pass that back to Rust. But, our NIFs expect a Reference to be given, and now we’ve got this Xairo.Image struct. But, the Xairo.Image struct has a reference field, so we can pass that to our Xairo.Native functions.1

To do this, we’ll need functions that take a Xairo.Image and arguments, extract the reference from the image, pass it and the other arguments to Xairo.Native, and return the full Xairo.Image struct so that we can continue using the |> operator. For a public-facing API, the root module Xairo seems like a good place for these functions to live.

defmodule Xairo do
  def paint(%Xairo.Image{reference: reference} = image, red, green, blue) do
    Xairo.Native.paint(reference, red, green, blue)
    image
  end
end

For the sake of brevity we’ll exclude the other functions, but this example should be enough to show the path we’re starting down.

If we hop over into an IEx console, we can see that our previous example

iex(1)> Xairo.Native.new(100, 100) \
...(1)> |> Xairo.Native.paint(0.5, 0.0, 1.0) \
...(1)> |> Xairo.Native.save("test.png")
#Reference<0.3888755325.2009989122.47088>

can now be written

iex(1)> Xairo.Image.new(100, 100) \
...(1)> |> Xairo.paint(0.5, 0.0, 1.0) \
...(1)> |> Xairo.save("test.png")
%Xairo.Image{ ... }

It’s not much to look at, but under the hood we’ve built the following framework with clear demarcations that gives us a base to expand on:

Xairo / Xairo.Image <–> Xairo.Native <–> Rust

The user calls functions from the Xairo or Xairo.Image modules, which are delegated to the Xairo.Native module, which provides the bridge to the Rust code. From there, the reference is passed back up the chain and finally returned as part of a Xairo.Image struct.

In the next posts we’ll start looking at expanding our function palette to start drawing shapes on the image, as well as how to avoid some of the code duplication that’s going to start showing up in the Xairo API functions.

Thanks for reading!


Footnotes

  1. In theory we could create a Rust struct that maps to Xairo.Image and pass the struct itself back and forth, but this quickly becomes unwieldy when trying to sort out how to ensure that the native cairo-rs structs can be encoded/decoded safely. It’s not impossible, but it is the decidedly harder of the two options. We’ll see this put into practice a bit later on with some much simpler structs. 

Permalink

If you recall from the last post, we succeeded in our goal of generating some art. Or something resembling art, at least. A stepping stone on the way to art, let’s say. We’re not going to get any more artful in this post, but instead we’ll be taking time to clean up what we’ve accomplished so far to provide a solid base to build a full library off of.

So, where did we leave off?

The library as it stands

Right now we’ve got the Xairo.Native module that provides the function signatures we can map to our Rust NIFs:

defmodule Xairo.Native do
  use Rustler, otp_app: :xairo, crate: "xairo"

  def new(_, _), do: :erlang.nif_error(:nif_not_loaded)
  def paint(_, _, _, _), do: :erlang.nif_error(:nif_not_loaded)
  def save(_, _), do: :erlang.nif_error(:nif_not_loaded)
end

Except for new/2, each of these functions takes as its first argument an Elixir Reference that maps to an in-memory Rust struct that holds the cairo ImageSurface and Context objects:

pub struct CairoWrapper {
  pub context: Context,
  pub surface: ImageSurface
}

Functions in Rust accept the reference and return it, allowing us, as we saw in the previous post, to chain functions together with the |> operator in a more Elixir-y way, but the fact remains that we’re passing around a Reference struct which is, to put it mildly, useless, when it comes to understanding what that reference represents.

So let’s create a struct that gives us some basic information about the image, along with storing the image’s reference

Xairo.Image to the rescue

defmodule Xairo.Image do
  defstruct [:width, :height, :reference]

  def new(width, height) do
    reference = Xairo.Native.new(width, height)
    %__MODULE__{
      width: width,
      height: height,
      reference: reference
    }
  end
end

Ok, great, we’ve a struct that tells us, for now, about the width and height of our image, and provides the reference to the in-memory image so that we can pass that back to Rust. But, our NIFs expect a Reference to be given, and now we’ve got this Xairo.Image struct. But, the Xairo.Image struct has a reference field, so we can pass that to our Xairo.Native functions.1

To do this, we’ll need functions that take a Xairo.Image and arguments, extract the reference from the image, pass it and the other arguments to Xairo.Native, and return the full Xairo.Image struct so that we can continue using the |> operator. For a public-facing API, the root module Xairo seems like a good place for these functions to live.

defmodule Xairo do
  def paint(%Xairo.Image{reference: reference} = image, red, green, blue) do
    Xairo.Native.paint(reference, red, green, blue)
    image
  end
end

For the sake of brevity we’ll exclude the other functions, but this example should be enough to show the path we’re starting down.

If we hop over into an IEx console, we can see that our previous example

iex(1)> Xairo.Native.new(100, 100) \
...(1)> |> Xairo.Native.paint(0.5, 0.0, 1.0) \
...(1)> |> Xairo.Native.save("test.png")
#Reference<0.3888755325.2009989122.47088>

can now be written

iex(1)> Xairo.Image.new(100, 100) \
...(1)> |> Xairo.paint(0.5, 0.0, 1.0) \
...(1)> |> Xairo.save("test.png")
%Xairo.Image{ ... }

It’s not much to look at, but under the hood we’ve built the following framework with clear demarcations that gives us a base to expand on:

Xairo / Xairo.Image <–> Xairo.Native <–> Rust

The user calls functions from the Xairo or Xairo.Image modules, which are delegated to the Xairo.Native module, which provides the bridge to the Rust code. From there, the reference is passed back up the chain and finally returned as part of a Xairo.Image struct.

In the next posts we’ll start looking at expanding our function palette to start drawing shapes on the image, as well as how to avoid some of the code duplication that’s going to start showing up in the Xairo API functions.

Thanks for reading!


Footnotes

  1. In theory we could create a Rust struct that maps to Xairo.Image and pass the struct itself back and forth, but this quickly becomes unwieldy when trying to sort out how to ensure that the native cairo-rs structs can be encoded/decoded safely. It’s not impossible, but it is the decidedly harder of the two options. We’ll see this put into practice a bit later on with some much simpler structs. 

Permalink

erlang-questions: A tale of push VS pull and authority VS lolberty

I just plowed through my email backlog for the first time in a long time to find that the venerable and seemingly eternal resource for n00bs and webtards in need sage graybeard advice known as the “erlang-questions” mailing list is now deprecated by the Erlang Something-or-other (Foundation? whatever).

I suppose this means they intend to nuke the archive of messages as well, which effectively obliterates the thing Joe Armstrong liked about it the most: the most useful posts and threads tended to turn up in web searches and were easily linked as resources years after the discussions themselves happened.

This came to my attention by way of a monster thread that started out innocently enough as “PING TEST“:

My apologies for spamming, but the list was unusually quiet for the last couple of days, and I just wanted to test if it is still active.
Kind regards

Valentin Micic (Fri Dec 3 08:08:30 CET 2021)

After a few stylish replies of a form common on periodically low traffic lists that simply indicated that the list was still working, a somewhat unexpected reply was then encountered:

Nowadays you will find more activity at https://erlangforums.com

Björn Gustavsson (Fri Dec 3 09:05:31 CET 2021)

This prompted an unsurprising and appropriate reaction from none other than Richard O’Keefe (aka “ROK”, aka “The ROK”, aka “The guy who knows way more than you do so if he doesn’t have time to explain in detail just do whatever he says and you’ll be much closer to Doing It Right”):

As far as I am concerned, anything that diverts traffic away from the Erlang mailing list to a labyrinth like erlangforums.com is a Bad Thing. Sigh.

Richard O’Keefe (Fri Dec 3 09:26:48 CET 2021)

This was followed up by quite a slew of posts from long-time Erlangers noting basic problems with the way the forums are framed by the Foundation (or whatever it is), trouble with the legalese involved, the technical problems involved in future mining of old wisdom from the forums if any ever turns up there, and subtly and not-so-subtly pointing out that this fits a disturbing trend of the webtards and wokescolds of the intertubes slowly breaking up the community over time and driving the graybeards away. Check the archive thread out — it is quite interesting to “next post” your way through it.

By the end it was clear that feelings among the long-time Erlangers ranged from extreme allergy to more neutral mild annoyance, as one often sees when The Committee is clearly doing something absolutely fucking stupid and Those Who Legitimately Know Better are trying to tell them so without starting a war over it. Fitting with the trend, the only two long-time Erlangers who actually support the move to trash-tier web forums are the literal card carrying communist and the kommunity karen (the guy who called me a racist and a Nazi on Slack because I said that I believe national borders are a thing that governments have a duty to their citizens to enforce — clearly the heart and soul of the National Socialist platform and not at all an indication that the angry little man in question had just proved Godwin’s Law right yet again).

Having been away from community interaction for quite a while and figuring this would probably be my last post before they get rid of it entirely, I decided to offer my $2.50 (no, the inflation is not transitory and if you believed it was you have cognitive trouble of the same sort Grandpa Badfinger suffers) only to be met with this beauty of a response from the mailing list daemon:

Irony is part of Nature’s sense of poetic humor

Let me just say HA HA HA!

In the spirit of not letting anything go into the memoryhole as long as my own site remains in operation, here is what I wrote:

On 2021/12/19 9:02, Jeff Schultz wrote:
> On 18/12/2021 04:02, Igor Clark wrote:
>> I agree wholeheartedly. Thank you for putting it so carefully and insightfully, Yao.
>
> I also agree.
>
> On an internet mailinglist, I can listen.
> On an internet forum, I am property.

At least you grok the motivation behind this sort of move.

A central point I have to drive into my geopol and intel students over and over is that "Capacity drives intent". Even if the original intent was to make a more "modern" single place to discuss Erlang, once that place becomes The One True Community the exclusionary capacity enabled by it will eventually change the intent of those running it. (In the end it will only make the Erlang community lore harder to find in the noise -- but whatever, no online platform is forever unless you do it yourself.)

I've just had to make my own little island of misfits -- mostly new Erlangers who want to solve problems and really don't care about anything else, least of all feels politics and other regulatory nonsense that people want to inject into projects and communities by rudely shoving their CoC into it. I don't really see a way around going independent to avoid the worsening social cancer.

I'll be on the ML as long as it is around and might occasionally check the forums if I'm super bored, but I'm pretty much seeing this as the end of the graybeard era -- it will definitely be a case of "nobody realized what they had until it was gone".

I have absolutely benefited an unfair amount from having this mailing list as a resource for the last several years. Thanks to all for the lessons and the laughs. Being proven wrong on the ML has always been one of those magical experiences where I could directly map blows to my ego to immediately useful lessons learned -- really fantastic stuff.

A late Merry Christmas to you all! Weee!
-Craig 

Anyway, all golden ages end eventually — our task as survivors is to simply push on ahead and try to manifest another golden age as soon as possible, and if we are both diligent and lucky we may just generate a small cascade of renaissances amongst one another. Let’s hope for that future instead of being bitter.

On that note, if anyone is interested in my little island of misfits just send me an email and I’ll point the way. Saving that I’m keeping the candle on at the #erlang IRC channels on OFTC and EFNet as well as the StackOverflow Erlang/OTP chat (I’m sure there is still an #erlang on freenode and libera, but until that ball of retardation is a distant memory and the landscape is clear I’m spending my effort elsewhere).

[Protip for 2022: Spend time with your family, friends and loved ones. Mend some fences or establish relationships with people you can physically see in real life (like really real life, not online). Never know when that’s going to come in handy.]

Permalink

Plato's Dashboards

2021/12/22

Plato's Dashboards


original images: [1], [2]

In the allegory, Socrates describes a group of people who have lived chained to the wall of a cave all their lives, facing a blank wall. The people watch graphs projected on the wall from metrics passing in front of a dumpster fire behind them and create dashboards for these metrics. The metrics are the prisoners' reality, but are not accurate representations of the real world.

Socrates explains how the engineer forced to use their own product is like a prisoner who is freed from the cave and comes to understand that the dashboards on the wall are actually not reality at all...

There's going to be nothing new under the sun if I just state that the map is not the territory and that metrics shouldn't end up replacing the thing they aim to measure (cf. Goodhart's Law), and while most people would agree with these ideas in principle, we all tend to behave very differently in practice.

Mostly, the question I ask myself is how do we make sure metrics are properly used to direct attention and orient reactions, rather than taking them for their own reality at the cost of what's both real and important.

So in this post I'm going to go over what makes a good metric, why data aggregation on its own loses resolution and messy details that are often critical to improvements, and that good uses of metrics are visible by their ability to assist changes and adjustments.

What's the Metric For

Earlier this year I was reading Dr. Karen Raymer's thesis, titled "I want to treat the patient, not the alarm": User image mismatch in Anesthesia alarm design, and while I enjoyed the whole read, one of the things that marked me the most was this very first table in the document:


From "I want to treat the patient, not the alarm": User image mismatch in Anesthesia alarm design.

The classification scheme used here is possibly the clearest one I've had the chance to see, one I wish I had seen in use at every place I worked at. Here are the broad categories:

  • Surrogate variable that is measured: the thing we measure because it is measurable
  • Variable of true or greater interest: the thing we actually want to know about
  • Measurement technique of surrogate variable: whether we have the ability to get the actual direct value, or whether it is rather inferred from other observations
  • Artefactual influences: what are the things that can mess up the data in measuring it
  • Certainty of "Normal Range": how sure are we that the value we read is representative of what we care about?

I don't recall ever seeing an on-call alert have this amount of care in defining what it stands for, including those I defined. And starting to ask these questions quickly creates a very strong contrast with the seriousness we give to some metrics. Do you even have clearly defined variables of true or greater interest?

Particularly in bigger organizations with lots of services where a standard has to be adopted, you often end up in a situation where you track things like "95th percentile of response time is under 100ms for 99% of minutes for each reporting period" or "99.95% of responses return a status code below 500" (meaning "successful or not broken by our fault"). So sure, those are the surrogate variables. What's the true one? User satisfaction? How "snappy" the product feels? Are the metrics chosen truly representative of that experience? What are the things that can pollute or mess with this measurement? Have we even asked these questions, or did we just go with a value some FAANG player published 10 years ago?

Let's pick user satisfaction as a variable of greater interest. What easily measurable proxy variables do we have? User satisfaction surveys? The number of support tickets opened? Reviews on external sites? If any of these are moderately acceptable, how often do they move in conjunction with your latency and error rate? If multiple metrics were adequate surrogates for user satisfaction, you'd expect most of them to react together in cases of disruptions from time to time, otherwise where is the causal link?

And really, to me, that's sort of the core point. Can you describe what the metric stands for in enough detail to know when it's irrelevant and you're free to disregard it, or when it's important stuff you actually need to worry about? Do we have enough clues about the context we're in to know when they're normal and abnormal regardless of pre-defined thresholds? If we don't, then we're giving up agency and just following the metric. We're driving our vehicle by looking at the speed dial, and ignoring the road.

Incidents and useless targets

It's one thing to deal with performance indicators at the service level like response times or status codes. These tend to retain a semantic sense because they measure something discrete, regardless of what they're a surrogate variable for and how much distance they have from the variable of greater interest. The information they carry makes sense to the engineer and can be useful.

Things get funkier when we have to deal with far more vague variables with less obvious causal relationships, and a stronger emotional component attached to them.

MTBF and MTTR are probably the best recent examples there, with good take-downs posted in the VOID Report and Incident Metrics in SRE. One of the interesting aspects of these values is that they especially made sense in the context of mechanical failures for specific components, due to wear and tear, with standard replacement procedures. Outside of that context, they lose all predictive ability because the types of failures and faults are much more varied, for causes often not related to mechanical wear and tear, and for which no standard replacement procedures apply.

It is, in short, a rather bullshit metric. It is popular, a sort of Zombie idea that refuses to die, and one that is easy to measure rather than meaningful.

Incident response and post-incident investigations tend to invite a lot of similar kneejerk reactions. Can we track any sort of progress to make sure we aren't shamed for our incidents in the future? It must be measured! Maybe, but to me it feels like a lot of time is spent collecting boilerplate and easily measurable metrics rather than determining if they are meaningful. The more interesting question is whether the easy metrics are an effective approach to track and find things to improve compared to other ones. Put another way, I don't really care if your medicine is more effective than placebos when there already exists some effective stuff out there.

Deep Dives and Messy Details

So what's more effective? One of my favorite examples comes from the book Still Not Safe: Patient Safety and the Middle-Managing of American Medicine. They look at the performance of anesthesia as a discipline, which has had great successes over the last few decades (anesthesia-mortality risk has declined tenfold since the 1970s), and compare it to the relative lack of success in improving patient safety in the rest of American (and UK) healthcare despite major projects trying to improve things there.

They specifically state:

[A]nesthesia was not distracted into a fruitless and sterile campaign to stamp out “errors,” as would occur in the broader patient safety movement. This is likely due to the substantive influence of nonclinical safety scientists whose field was beginning to develop new thinking about human performance. This new thinking held that “errors” were not causes but rather were consequences; that they did not occur at random but were intimately connected to features of the tools, tasks, and work environment. Thus they were symptoms of deeper problems requiring investigation, not evils to be eliminated by exhortation, accountability, punishment, procedures, and technology.
[...]
Anesthesia’s successful method was largely intensive—detailed, in-depth analysis of single cases chosen for their learning potential (often but not always critical incidents). The broader patient safety field used a health services research approach that was largely extensive—aggregation of large numbers of cases chosen for a common property (an “error” or a bad outcome) and averaging of results across the aggregate. In the extensive approach, the contributory details and compensatory actions that would be fundamentally important to safety scientists tend to disappear, averaged out in the aggregate as “messy details.” [...] The extensive approach is typical of scientific-bureaucratic medicine—the thinking that nothing much can be learned from individual cases (which in medicine are profanely dismissed as mere anecdotes), that insight comes from studying the properties of the aggregate. This approach has its roots in public health and epidemiology, not clinical care; it is exemplified by the movements for clinical practice guidelines and “evidence-based medicine,” with their implicit valuing of the group over the individual good. Thus the two fields used fundamentally different scientific and philosophical approaches, but no one remarked on the differences because the assumptions underpinning them were taken for granted and not articulated.

The importance of this difference is underscored by the early history of safety efforts in anesthesia. The earliest work conducted in the 1950s (e.g., Beecher) used a traditional epidemiological approach, and got nowhere. (Other early efforts outside of anesthesia similarly foundered.) Progress came only after a fundamental and unremarked shift in the investigative approach, one focusing on the specific circumstances surrounding an accident—the “messy details” that the heavy siege guns of the epidemiological approach averaged out or bounded out. These “messy details,” rather than being treated as an irrelevant nuisance, became instead the focus of investigation for Cooper and colleagues and led to progress on safety.

The high-level metrics can tell you about trends and possibly disruptions—such as tracking the number of deaths per procedure. But the actual improvements that were effective were the results of having a better understanding of work itself, by focusing on the messy details of representative cases to generate insights.

This is where practices such as Qualitative research come into play, with approaches like Grounded Theory, which leave room to observation and exploration to produce better hypotheses. This approach may be less familiar to people who, like me, mostly learned about research in elementary and high school science classes and were left with a strawman of the scientific method focused on "picking one hypothesis to explain something and then trying to prove or disprove it". The dynamic between approaches is often discussed in terms of Qualitative vs. Quantitative research, with mixed methods often being used.

I have a gut feeling that software engineers often act like they have some inferiority complex with regards to other engineering disciplines (see Hillel Wayne's Crossover Project for a great take on this) and tend to adopt methods that feel more "grown up"—closer to "hard" sciences—without necessarily being more effective. When it comes to improving safety and the records of whole organization around incidents, the phenomena are highly contextual, and approaches centered on numerical data may prove to be less useful than those aiming to provide understanding even if they aren't as easily measurable in a quantifiable way.

What's the Reaction?

So let's circle back to metrics and ways to ensure we use them for guidance rather than obey them reflexively. Metrics are absolutely necessary to compress complex phenomena into an easily legible value that can guide decision-making. They are, however, lossy compression, meaning that without context, we can't properly interpet the data.

A trick Vanessa Huerta Granda gives is to present the metrics you were asked for, but contextualize them by carrying the story, themes, qualitative details, and a more holistic view of everything that was in place. This is probably the best advice you can find if you really can't get away with changing or dropping the poorer metrics.

But if you're in control? I like to think of this quote from Systemantics:

THE MEANING OF A COMMUNICATION IS THE BEHAVIOUR THAT RESULTS.

The value I get out of a metric is in communicating something that should result in a change or adjustment within the system. For example, Honeycomb customers were asking how long of a time window they should use for their SLOs. Seven days, fourteen days? Should the windows line up with customer events or any specific cycle? My stance simply is: whatever makes them the most effective for you to discuss and act on a burning SLO as a team.

At the very basic level, you get the alarm, you handle the incident, and you’re done. At more advanced levels, the SLO is usable as a prioritization tool for you and your team and your organization to discuss and shape the type of work you want to do. You can treat it as an error budget to be more or less careful, as a reminder to force chaos experiments if you’re not burning it enough—to keep current in your operational practices—or as an early signal for degradations that can take a longer time (weeks/months) to address and scale up.

But all these advanced use cases only come from the SLOs being successfully used to drive and feed discussions. If you feel that one week gives you a perfect way to discuss weekly planning whenever it rolls over, that works. If your group prefers a 2 weeks duration to compare week-over-week and avoid papering this week’s issues because they’re gonna be budgeted next week only, then that works as well. Experiment and see what gets the best adoption or gives your group the most effective reaction you can be looking for, besides the base alerting.

The actual value is not in the metric nor the alert, but in the reaction that follows. They're a great trigger point for more meaningful things to happen, and maintaining that meaningfulness should be the priority.

Permalink

OTP 24.2 Release

OTP 24.2

Erlang/OTP 24.2 is the second maintenance patch release for OTP 24, with mostly bug fixes as well as a few improvements.

Below are some highlights of the release:

Highlights

  • crypto: The crypto app in OTP can now be compiled, linked and used with the new OpenSSL 3.0 cryptolib. It has not yet been extensively tested, so only recommended for experiments and alpha testing in this release. There are not yet any guaranties that it works, not even together with other OTP applications like for example SSL and SSH, although there are no known errors.
  • erts: An option for enabling dirty scheduler specific allocator instances has been introduced. By default such allocator instances are disabled. For more information see the documentation of the +Mdai argument to the erlcommand

  • erl_docgen, erts: All predefined types have been added to the erlang module together with documentation. Any reference to a predefined type now links to that documentation so that the user can view it.

  • erts: Responsiveness of processes executing on normal or low priority could suffer due to code purging or literal area removal on systems with a huge amount of processes. This since during these operations all processes on the system were scheduled for execution at once. The new solution is to limit the number of outstanding purge and copy requests to 2 times the number of schedulers as default

For more details and downloads follow this link

The Erlang/OTP source can also be found at GitHub on the official Erlang repository, https://github.com/erlang/otp

Permalink

Copyright © 2016, Planet Erlang. No rights reserved.
Planet Erlang is maintained by Proctor.