Pipeline debt: a universal experience
Every data engineer has had experiences like these:
A critical dashboard is on the fritz. Several metrics are broken … well, sometimes they’re broken. It started happening sometime in the last month, but no one noticed until yesterday. Your CMO and the head of product don’t know what to believe and are flipping out. They need these numbers to be reliable by Monday’s board meeting.
You start looking into it, but the underlying data architecture is reminiscent of a Jenga tower, and no one who worked on this dashboard wrote anything down. You cancel your weekend plans.
You’re pretty sure your model is broken in production. Everything seems fine in the development notebooks, but something happened during handoff.
You can speculate—maybe a feature isn’t being computed correctly? The input data has drifted? Somebody dropped a minus sign?—but all you can really be sure of is that something about the results is off. Oh, and that digging in and fixing it is going to take way more time than you have.
Everyone seems to have their own way of calculating (what should be) the same metrics. You didn’t have this problem two years ago when your system was brand new and you only had one data team. But since then, you’ve added two teams, eight engineers, and a mycelial network of interdependent tables.
The system needs a refactor, but the job seems thankless, and the potential scope is terrifying.
All of these problems are flavors of pipeline debt. And every flavor of pipeline debt tastes awful.
What is pipeline debt?
Pipeline debt is a species of technical debt: think Michael Feather’s concept of legacy code, just one step to the left.
Pipeline debt accumulates when the assumptions that justified the movement of data throughout an organization get lost. The data moves, but no one knows exactly why anything is set up like it is. So when the data stops moving correctly or the results start looking weird, it’s a Problem.
Why is there pipeline debt?
It’s easy for organizations to accumulate pipeline debt, even when their engineering teams are very competent. There are two big factors that cause this.
One: data systems naturally evolve to become more interconnected over time. It’s generally a given that “breaking down data silos” is a good thing—and it is—but in a DAG context, it’s easy to skip over thinking through what the inverse of a silo actually is.
There’s a reason that actual agricultural silos exist, after all, and it’s not because leaving piles of grain around was too easy.
Without a silo, your pipelines want to be hairballs. And since the point of breaking down the silo was for everyone to get data from everywhere, your PMs, executives, and customers are going to help make that hairball happen. And so are you.
Two: data pipelines cross team borders. Any time a new model, dashboard, or metric is handed off between roles or teams, there’s an opportunity for pipeline debt to creep in.
No two roles or teams have exactly the same perspectives, tools, and skill sets. Those differences introduce little continuity gaps. And those gaps are where the misunderstandings and misinterpretations that feed pipeline debt sprout.
Inside a team, pipeline debt can be introduced by something as simple as an analyst-to-engineer handoff. On a larger scale, pipeline debt can become especially acute when upstream and downstream systems are owned by different teams, as in the common case where the team that owns logging is different from the teams that own analysis and model building.
What makes pipeline debt different from other technical debt?
Debt in data pipelines is different from the kind of debt that accumulates in other software systems, especially when machine learning is in the picture—which is the case with increasing frequency.
Here are three issues that make data pipelines different from other systems:
Most pipeline complexity lives in the data the pipeline carries, not in the structure of the pipeline itself. That ‘most’ is because the structure of a DAG can introduce some complexity. But often, a pipeline consists of relatively simple code processing huge amounts of highly-variable data that’s full of edge cases. That’s just as true for basic ELT pipelines as it is for deep neural nets.
Data pipelines are full of bugs with soft edges. Data insight is all about statistics, and statistics always have a margin of error, which makes statistical models notoriously hard to test. And practically speaking, it’s often not worth debugging every single possible Unicode error, so we soften up even supposedly-deterministic pipelines to ignore a small number of edge cases. And then we don’t document what ‘small’ is.
Insights get left on the cutting room floor. A data analyst or scientist who’s exploring new data develops a mental model of the shape of the dataset and the context in which it was created. Those mental models are semantically rich and provide a lot of nuance into the data’s real-world relevance and true meaning. Unfortunately, that semantic richness is almost always lost in translation as the data work moves farther from its originator. Highly-personalized understandings of the data end up automated as brittle pipelines that are divorced from the contextual reality of the data they carry.
These factors, plus the wealth of opportunities to introduce technical debt into a pipeline, mean that pipeline debt is absolutely rampant. Like pollution in a river, it accumulates slowly but surely until the entire thing stinks.
What does pipeline debt cause?
Pipeline debt isn’t just an annoyance for technical teams. It introduces both productivity costs and operational risks.
On the productivity front, pipeline debt manifests as a proliferation of unintended consequences that means anything involving data has a slow, unpredictable development cadence. This is a huge logistical drag on agile (and Agile) teams scrambling to respond to increasing pressure for high-quality insights and demand for new AI/ML tools. And just as importantly, it’s an emotional drag on those same teams: uncertain and frustrated, data engineers lose speed and the work becomes not fun.
If you’re not a data engineer, ‘data work not being fun’ might not seem like a real problem. But effective innovation with data, particularly in the fast-evolving AI and ML spaces, requires a lot of creative thinking. Unhappy data engineers just aren’t going to be as effective as happy ones would be.
(Data engineers are, of course, by no means unique in producing better work when happy.)
Operationally, debt-heavy data systems emit a long-tailed stream of glitches. Severity ranges from ‘annoying’ to ‘catastrophic,’ but all of them erode trust, sometimes to the point of putting the system’s core usefulness in doubt. If you can’t trust your dashboard, report, or prediction, why do you even have it?
The solution to pipeline debt
The solution to general technical debt isn’t a secret: it’s automated testing. Testing builds self-awareness of the system by systematically surfacing errors. Virtually all modern software teams rely heavily on automated testing to manage complexity.
Except for the teams that build data pipelines.
They are testing their code. But as we described before, the complexity in data pipelines is mostly not from the code. So the same testing practices don’t do as much for pipelines as they do for other software systems.
Test the pipeline’s data, not just its structure
For data pipelines, we need to test the place the complexity lives, which is in the data.
And we need to test at the point when new complexity might be introduced. That mostly isn’t when new code is deployed; it’s when new data arrives.
A single piece of new data doesn’t fuel the fire of pipeline debt. It’s the totality of the data that arrives—the batch—that starts causing problems.
Our term for a test deployed on a batch of arrived data is the pipeline test.
We didn’t wholesale invent this concept. Most data teams naturally evolve some defensive data practices to manage the chronic pain of pipeline debt.
These are all examples of informal pipeline test practices:
“I always flush new data drops through `DataFrame.describe` before running it through our ETL.”
“We’ve started to get much more disciplined about enforcing typed data, especially on feeds that get shared across teams.”
“Every time I process a new batch, I run a series of row counts on the new tables. It’s the simplest possible test, but you’d be surprised how many bugs I’ve caught.”
“We maintain a team of data quality analysts and impose strict SLAs on internal data products.”
These ad hoc practices are good ideas, but they aren’t standardized. This means they aren’t easily transferable between teams or organizations, and communicating about them requires a lot of effort and documentation.
Why GX does it best
We built Great Expectations to create a standard for pipeline testing—one that’s shared, effective, open, and portable:
Shared: Using GX creates a shared frame of reference and a shared toolbox for implementation, so any team using GX can easily understand what another team using GX is doing.
Effective: By focusing on the data in the pipeline rather than the structure of the pipeline, GX tests the real source of the complexity that’s at the root of the pipeline’s debt.
Open: As an open source tool, GX’s transparency in its operations and results means that it doesn’t add another layer of mystery on top of your already-troublesome pipelines.
Portable: GX’s low barrier to entry means that the majority of organizations have the technical requirements to use it right now without adding any new technology. Data engineers can bring GX with them between jobs, allowing them to build long-term expertise that benefits every organization they work with.
Collaboration is the key
One thing that’s been made very clear to us over the past six years of GX is that data quality isn’t a purely technical problem. If it were, it would have been solved by now.
Data quality’s real challenge is that it’s a collaboration problem. Data professionals face challenges in communicating with other data teams and nontechnical teams alike, though not the same challenges.
At GX, we’re focused on solving data quality in a way that fosters better communication—and therefore collaboration—for data teams. Improving the speed and integrity of data collaboration is what will allow teams to truly solve data quality problems.
Since we first wrote about pipeline debt five years ago, the GX platform has grown by leaps and bounds in pursuit of better data collaboration. It has hundreds of contributors and millions of PyPI downloads, and we’ve recently introduced the GX Cloud Beta to complement GX Open Source.
On the people side, our Slack community has more than 10,000 members, with more joining every day. GX has grown from a side project into a company in its own right, with more than 50 employees.
In other words: we’ve proven that GX is on the right path, though we’re by no means done. Check it out, drop in on our Slack, or join one of our community meetups to see how we’re working to revolutionize data quality.
This post is an update of 2018's Down with pipeline debt / introducing Great Expectations.