Downstream effects

Swarm, solve at source, and share

Elton John was in town a few weeks ago, playing the second of two shows in New Zealand that he’d been forced to postpone a couple of years ago.

Things didn’t go according to plan.

The rain poured, and poured, and poured… and fifteen minutes before the show was scheduled to start everyone was told to go home. The show was cancelled, and Elton never made it on to the stage.

You could hardly blame him for backing out: the rainfall on that day was described as a 1-in-200-year event, and January 2023 would ultimately go down as the wettest month in Auckland’s history. Just a week or so later we were hit by a cyclone that caused massive damage in Auckland and beyond, and now they’re telling us that other cyclones are forming. The great Kiwi summer it is not.

My family and I weren’t totally unaffected by the events of 27 January (my wife spent over six hours driving around Auckland before she was finally able to make it home), but we remained safe and we didn’t suffer any property damage. Despite all that, we were at the very least inconvenienced when we woke up on the Saturday with no running water, and for the next couple of days that’s how it went: either no water at all, or very low pressure.

There’s something unsettling about not knowing if anything will come out of the tap when you turn it on, and so I found myself obsessively monitoring the Watercare website. The photo at the top of this newsletter shows you where a 30-metre-long section of water pipe in Titirangi was washed away with the road due to a landslide, which got a lot of attention. Except we don’t live in Titirangi (even though we live in a suburb close to it), and it wasn’t entirely clear why we weren’t getting any water.

The uncertainty was the thing that bothered me the most. We could see Watercare vans drying around the neighbourhood, desperately trying to figure out where the fault was. The website eventually said as much: they didn’t know why we weren’t receiving water, and they were trying to find the fault. This seemed to go on for an eternity, even though it was less than three days in total.

By Monday we had water, even though the fear was always there: would it go again? Would the pressure drop? The fault disappeared from Watercare’s website, we no longer saw the vans driving around, so it appeared the problem was fixed.

Even when the cyclone hit a week or so later we didn’t lose water, so they must have done something right; I just never knew what that was.

And all the time I kept thinking: they’re relying on us, the customers, to let them know that their core service (providing clean water) has failed.

The Andon Cord

A photograph of a person in a car manufacturing plan pulling on a yellow cord above their head

I’ve just finished reading The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win. If you’re working in software development then you’ve probably heard of it, but it was new to me. I still don’t know if I entirely understand DevOps after reading it, but there are some key takeaways that are relevant here.

Throughout the novel, the main character is guided by a sage-like figure who pops up from time to time to talk about how things like Lean Manufacturing techniques are relevant to IT work. A lot of these techniques were mastered at Toyota manufacturing plants, and one such technique is the concept of swarming.

This is where the Andon cord comes into play. In the appendix to my edition of The Phoenix Project there’s a preview of The DevOps Handbook, in which the role of the Andon cord is explained:

In a Toyota manufacturing plant, above every work center is a cord that every worker and manager is trained to pull when something goes wrong; for example, when a part is defective, when a required part is not available, or even when work takes longer than documented.

When the Andon cord is pulled, the team leader is alerted and immediately works to resolve the problem. If the problem cannot be resolved within a specified time (e.g., fifty-five seconds), the production line is halted so that the entire organisation can be mobilized to assist with problem resolution until a successful countermeasure has been developled.

Three reasons are listed for why it’s advantageous to halt work and swarm the problem:

It prevents the problem from progressing downstream

It prevents the work center from starting new work

If the problem is not addressed, the work center could potentially have the same problem in the next operation

All of this, the authors note, seems contrary to standard management practice: “we are deliberately allowing a local problem to disrupt operations globally.” But the crucial thing here is that swarming enables learning:

It prevents the loss of critical information due to fading memories or changing circumstances. This is especially critical in complex systems, where many problems occur because of some unexcpected, idiosyncratic interaction of people, processes, products, places, and circumstances - as time passes, it becomes impossible to reconstruct exactly what was going on when the problem occured.

In The Phoenix Project, a lot of IT problems are solved by one person: Brent. This means that solutions aren’t standardised, and so any sufficiently complex IT problem is Brent’s problem. Once the main character recognises this, he begins to involve other engineers whenever a problem has to involve Brent: they swarm around the problem, learn about how it’s solved, and so Brent can focus instead on work where he’s needed instead of putting out all of the fires.

Swarm around the problem. Share the solution.

Dear Mr Jane Doe

In one of my previous jobs at a university in England, we would send out an alumni e-newsletter every month. And so every month we would pull a list of emails and other personal information from our CRM: names, salutations, year of graduation etc.

Every time we did this, without fail, we encountered the same problems: we found that some of the salutation fields weren’t quite right; we found that we’d pulled out the details of a married couple who were both our alumni, but we knew they preferred a single e-newsletter sent to the wife’s address; we found that there was an obvious typo in a group of email addresses that had been imported into the database together. And on, and on, and on…

We usually couldn’t pull the data out of the CRM until hours before the e-newsletter was due to go out, because our colleagues wanted to ensure we were using the most up-to-date information. This meant that any corrective actions we needed to take on the data were done in haste, which was stressful and meant we risked missing something.

We got so used to this process that we even had a list of corrections we needed to make every time: change Mr Jane Doe to Ms Jane Doe, change jane.doe@tmail.com to jane.doe@gmail.com. We never looked forward to doing this work, and it always took hours and hours to get it all done; if that meant delaying the e-newsletter, then that’s the way it had to be. After all, we couldn’t compromise on data quality.

The solution is obvious isn’t it? Correct the information in the database!

We weren’t idiots, we knew this. But the moment we finished sending off that corrected data for the monthly e-newsletter, it was on to another urgent task. And then another one. And then, before you knew it, a month had passed and it was time for the next e-newsletter.

We were all busy all the time, but looking back there was no excuse for it: by fixing the same problem every time at a downstream point, we risked passing the problems further downstream (more than one alumnus was addressed incorrectly or congratuated on ten years since graduating despite having graduated the year before), it tied us up with manual data corrections that went on for hours, and it meant that we were fixing the same problem every month.

We should have known better, but what I’ve described isn’t an unusual way of preparing data lists. For all I know that alumni office in that university in England is still preparing data for its monthly alumni e-newsletters in the same way.

Here’s another example of a similar problem.

When I recently noticed that one of our key metrics was trending higher than usual, I immediately strated crafting a positive message for the month-end report: we were finally recovering after the successive COVID-induced lockdowns!

I was only a few words in when I decided to dig a little deeper into the numbers, and I soon realised that the massive increase was due to just one facility. Except, there was no way that facility could produce those sorts of numbers.

Sure enough, when I pointed this out to my colleague, he told me it was a data error due to the Christmas holiday period. He fixed it, refreshed the data, and things returned to normal.

Later, I asked him: what did you do to fix it?

It turns out the data is recorded by staff at the facility in question within an Excel spreadsheet, and a complicated series of formulas in the spreadsheet calculates the metric we need, which eventually makes its way into our data warehouse and a Power BI report where I got the number.

Even though the formula has been fixed in the spreadsheet, there’s no guarantee we won’t get incorrect numbers in future. One mitigation strategy moving forward might be for my colleague to check that Power BI report every month for errors, and return to the spreadsheet to apply any fixes necessary.

Correcting the same errors every month. Are you having deja vu here?

Here, we’re finding an error downstream and applying a patch upstream. That’s problematic in two ways: we’re leaving it too late to see the problem, and rather than eliminating the problem we’re merely fixing it.

I was reminded of my water problems when this issue came up.

Prepare for unforeseen consequences

Watercare found out about the problem when a bunch of residents reported a water outage on their website, and even though they restored our water they didn’t provide any guarantees to their customers that they had eliminated the problem.

I’m not remotely qualified to comment on whether or not Watercare could have done more to prevent the water outages in our neighbourhood after the 27 January floods, but during the subsequent three days I couldn’t help thinking about the analogy with data engineering.

When things go wrong with our data pipelines, we should know about it early: that’s why we implement data observability, and why we write unit tests for the code that acts on our data. This not only alerts us to a problem the moment it arises, but it pinpoints the source of the problem; it could even indicate a weakness in our pipeline before it becomes critical.

In my work contexts, we could see the effects of the problem (albeit by chance in the second case), but we were just patching it over rather than eliminating it entirely: the alumni e-newsletter data still needed fixing every time, and my colleague’s spreadsheet could easily break again.

This is why data observability and unit tests are only worthwhile if we address the root of the problem, and why I still felt uneasy even after the water came out of my taps again: was this a quick fix, or had the problem been eliminated?

Things will go wrong, and you should always expect a failure to occur somewhere. But if you take just three things from all of this, here it is:

Monitor your system from end to end so that you’re alerted the moment something goes wrong, and where it went wrong.

Address the problem at source, and don’t just patch it over: eliminate the problem so it doesn’t reoccur.

Swarm the problem: don’t let work pile up until the problem has been addressed, and make sure your colleagues (or customers) know what you did to fix it.

As was once said in a more threatening context than I intend here: prepare for unforeseen consequences. They will happen, and you (and those around you) will be better off if you know when, where, and how they occurred.

Swarm, solve at source, and share

The Andon Cord

Dear Mr Jane Doe

Prepare for unforeseen consequences

Enjoyed this post?

How mature is your data capability?

Swarm, solve at source, and share

The Andon Cord

Dear Mr Jane Doe

Prepare for unforeseen consequences

Share this post

Enjoyed this post?

How mature is your data capability?

Related posts

The Personal and the Programmatic

My 5 Most Popular LinkedIn Posts of 2024

World Enough and Time