How DAGs grow: Deeper, wider, and thicker
Introductory note: this blog post is a refresh of a talk that James and I gave at Strata back in 2017. Why recap a 3-year-old conference talk? Well, the core ideas have aged well, we’ve never actually put them into writing before, and we’ve learned some new things in the meantime. Enjoy!
November 16, 2020
After working in data science and engineering for many years, we’ve observed a common pattern. The effort level required to maintain data systems typically looks like this:
There are three phases. First, when you first start building a data system, there’s an up-front cost to figure out how it all should work, fit it together and build it.
Second, for a while it just works. We call this phase the “miracle of software.”
Third, after a while, you start noticing that you’re spending a lot of time on maintenance: bug fixes and root cause analysis. Trying to figure out why things are breaking in your data pipelines. Making additions and adjustments as other systems appear or change.
Over time, this turns into a steady, compounding creep in the maintenance time that you’re putting into the project. If that gets too far out of control, the work just stops being fun. Stuff breaks unpredictably, timelines become highly variable, and people burn out.
We wanted to articulate why this happens, to understand how to prevent it. In the end, we arrived at this mental model of the core dynamics.
Why compound growth?
To understand why the cost of maintenance compounds over time, we need to break down how data scientists, analysts, and engineers really spend their time. At a high level, most data teams spend their time asking questions and building infrastructure to make that analysis repeatable. Since we’re interested in the cost of maintaining data systems, we’re going to focus on the repeatable aspects.
When data analysts, scientists, or engineers build repeatable pipelines, they build DAGs.
Here’s a stylized picture of a DAG. For the sake of this example, we’ll call it an ETL/ELT project. But it could just as easily be a machine learning pipeline, or set of BI aggregations, or an operational dashboard. In data work, everything is a DAG.
Returning to our ELT example: say we ingest three tables from an upstream data source. We then need to munge and clean the data. This likely includes things like small joins and filtering.
Let’s say that the end result of that pipeline is two cleaned tables, derived from the three tables of messy, external data that we started with.
Boom. Functioning data pipeline.
Deeper, wider, and thicker
What happens next? Answer: the DAG grows.
When people trust a source of information, they tend to ask more questions of it. When the same questions have repeated value, their supporting DAGs tend to grow---in three specific ways.
First, they grow deeper. As soon as we’ve got that nice clean, normalized data, someone is going to ask to see it. In the simplest case, this could be a one-off query. If it’s made repeatable, those queries probably become dashboards or reports.
For the sake of this example, let’s say you end up building four dashboards on top of the initial cleaned table.
Second, data systems grow by getting wider—by adding additional data sources. Of course, each of these will need their own ELT as well. And then new data products will be built on top of those nodes in the DAG.
For our example, let’s say that this leads to 2 new ingested sources, and 2 normalized tables. On top of that, we build an alerting system with three types of alerts at the bottom of the DAG.
The third way that data systems grow is by getting thicker. In our example, it won’t be long before users start asking to add alerts that use some of the same data as the dashboards, and add dashboards that report on the alerting system.
Saying that DAGs “grow thicker,” is the same as saying that they become more interconnected. When you map this out visually for actual DAGs, they usually turn into messy-looking hairballs, fast.
The thing is, that messy-looking interconnectedness is often the whole point of the data system. Breaking down information silos, sharing data, creating more contextually aware decision support systems---this is the work that data teams are paid to do.
We’ve now sketched out most of the main causes for compounding maintenance burden in data pipelines. There’s one last piece: downstream consequences.
Because data flows through the DAG, changes in the upstream DAG can affect the behavior of the downstream DAG. Even seemingly small changes can have large, unexpected consequences.
For example, changing an upstream table to allow null values could change the denominator in important downstream calculations. Perversely, adding null values might not have any impact on the numerator---which means that many reports based on exactly the same tables and columns might not be affected.
Another example: suppose an upstream logging team changes the enumerated values assigned to certain types of logs. If those values were being used to trigger alerts and the alerting system isn’t updated at the same time, you might suddenly find yourself responding to lots of false alarms.
More subtly, if the values were being used as inputs in a machine learning model, the model might silently start generating bad predictions. Depending on the type of model and how much weight was assigned to the affected variables, the impact on predictive accuracy could be large or small, widespread or concentrated in a few cases.
All of these types of problems have a few things in common:
- Upstream changes can have unexpected downstream consequences.
- Unintended consequences don’t necessarily show up in nodes immediately adjacent to where the change was made. They can skip levels and show up much deeper in the DAG
- Unintended consequences can cascade to cause additional havoc further downstream in your DAG.
A quasi-proof for compounding maintenance cost
Putting all of this together, we can start to understand why the cost of maintenance compounds.
- The cost of maintenance is directly tied to the probability of downstream consequences of upstream changes in the DAG.
- The probability of unintended consequences is a joint function of the number of nodes in the DAG, the density of edges in the DAG, and the frequency of changes in the upstream DAG.
Both of these factors tend to increase as DAGs grow. Which is why the probability of downstream consequences and the cost of maintenance increase as a super-linear function of the size of the DAG.
Okay, this is a good breaking point. In a followup article, I’ll get into more details about how DAG maintenance plays out in practice, especially when data flows cross team boundaries and downstream consequences of changing DAGs surface at unexpected moments.
You should star us on Github