Data teams are usually first and foremost responsible for maintaining existing data pipelines and building new ones. Let’s talk about the lifecycle of a data pipeline before it breaks a dashboard.
Data pipelines: growing complexity over time
First, it’s all exciting and fun in the development phase: figuring out what it is the stakeholder wants, identifying data sources that are going to be useful for the analysis, and finally solving the problem that needs solving. This is engineering: making choices in architecture and implementing them.
After the pipeline is deployed, it usually works just fine for some time. However, upstream changes are unavoidable. For instance, the product team could change the user application flow which in turn affects the funnel analysis dashboard. Alternatively, the logging team changes the data format for a table used in several queries. Or even worse, Facebook changes the format of their ad data that’s parsed by the ETL into your data warehouse.
All of these changes require maintenance to the pipeline by the analytics team. With rising data complexity, more and more time is spent finding the issue than it is actually fixing the pipeline.
This accumulation of misunderstood assumptions in code is pipeline debt.
Let’s talk about the analyst from Part 1 that would find themselves debugging a dashboard for the VP of Marketing. If they wrote the pipeline, that makes the job easier. However, if the original pipeline was written by another analyst or another team, before the analyst can debug the problem they must first take time to understand the pipeline and how it should work.
Complexity Carrying Capacity defined
A team has fixed resources. A fixed number of data analysts, analytics engineers, and data engineers with a fixed number of hours in each day.
At a single point in time, the team also has a fixed number of pipelines. However, as discussed above, a single pipeline’s complexity is not fixed over time. With growing complexity, more and more of the team’s resources are needed to maintain the single pipeline.
With fixed resources but growing time needed for maintenance, a point is naturally reached where everyone is spending all of their time maintaining the system without any time left to build out new features.
Defining complexity carrying capacity: the maximum time an analytics team can support existing pipelines based on team capacity.
The goal is to move this point as far out as possible and keep it in check. The further out the complexity carrying capacity is, the more the same team can accomplish.
An efficient analytics team is a team that has maximized their resources by maximizing their complexity carrying capacity; in other words, an efficient team can maintain a large number of existing pipelines while continuing to build out new features.
You might be wondering: for a given team size, is this point static? No, it definitely isn’t! The equation consists of three parts:
[team resources] x ([maintenance] + [new features]) = [complexity carrying capacity]
Team resources aren’t changing (well, of course you can hire but that doesn’t solve the underlying issue of an inefficient team). However, maintenance time can definitely be influenced. Imagine if our poor analyst knew where the root cause of the broken dashboard was within a few minutes instead of a several hours?
How can reducing pipeline debt help?
In the existing system, they don’t because of, you guessed it, accumulated pipeline debt. Ask any data engineer, and they’ll surely tell you pipeline debt grows exponentially, not linearly. Unmanageable pipeline debt directly decreases the complexity carrying capacity of a team, and decreases it fast.
Let’s continue to drive one point home: complexity will grow over time as the data itself gets more complex, and this is completely natural.
However, there are different ways of handling this growing web of data. Accepting the exponentially growing maintenance time required with growing complexity is not scalable nor a path towards an efficient analytics team. Instead, let’s focus on what we do have control over that impacts maintenance time.
Once an issue is found, the time to implement a fix is not something that can be easily controlled. If upstream data formats from third-party services change to a nested JSON format, it just is what it is. However, finding the issue itself occurs within code that is internal to the organization.
Just like software engineers should write unit tests, analytics teams should write pipeline tests.
Testing data upon each pipeline execution ensures certain expectations are met of the output. This approach helps document data format, as well as show exactly where a failure arises. Imagine if our analyst, the one performing root cause analysis on the broken dashboard, could point to a single failing test? They would save hours of investigation time.
On why pipeline testing increases complexity carrying capacity and the how-to of implementation, continue to Part 3! Maximizing Productivity of Analytics Teams Part 3: Pipeline Tests & Great Expectations
- Pipelines will grow in complexity, and that’s completely natural.
- Exponentially growing maintenance time is not natural.
- Pipeline tests decrease pipeline debt by providing documentation and clear indication of a failed pipeline and downstream dependencies.
- Increasing complexity carrying capacity allows an analytics team to accomplish more with the same resources.
this blog is inspired by Strata 2018: Pipeline Testing with Great Expectations and Testing and Documenting Your Data Doesn't Have to Suck
By Sarah Krasnik
Thanks for reading! I love talking data stacks. Shoot me a message.