Imagine this scenario: you know that pipeline tests on your data are absolutely crucial to your overall data quality. You’ve taken the steps to put those tests into place, and whether you’ve used an off-the-shelf tool, an open source option, or homegrown code, you’re ready to sit back, relax, and watch your pipelines deliver beautiful, high-quality data to your stakeholders.
You imagine your test foiling a dastardly batch of unsuitable data before it can infiltrate your dataset. And that’s when you realize you missed a step: what’s going to happen when your tests actually fail?
How are alerts going to come in? Are alerts going to come in? Is anyone monitoring the alerts? Who’s in charge of responding to alerts? How will you find out what went wrong? How do you fix the data issues that your pipeline tests find?
As exciting as it is to implement a suite of solid pipeline tests, a significant part of the art of data quality testing is figuring out how to effectively respond when you detect a data problem.
In this post, we’ll walk through some of the key stages of responding to data tests and outline the major considerations for developing your team’s data quality strategy.
The stages we’ll discuss are:
System response to failure
Logging and alerting
Root cause identification
Plus, we’ll talk about the importance of maintaining stakeholder communication throughout these stages.
The first line of response to a failed data test happens before any humans are notified, when the system’s automated responses kick in. These responses decide whether or not to continue any pipeline runs and are typically some variation on the following forms:
Do (almost) nothing: the system logs the failure or sends an alert, but the pipeline continues to run.
Isolate the problem: the system does something to quarantine the “bad” data, such as moving rows that fail tests to a separate table or file. The rest of the data continues normally through the pipeline.
Stop the pipeline.
It’s usually advisable to vary the system’s response depending on the severity of the detected issue and the downstream use case. Some types of issues might prompt a warning to stakeholders, but the pipeline can keep running; other critical errors could merit stopping the pipeline altogether.
Logging and alerting
While it’s absolutely possible to simply write all your data validation results to a log and review them at your leisure, in our experience, you’ll almost always have tests that are critical enough for failures to require proactive alerting.
Some factors to consider when determining your approach to alerting are:
Which errors absolutely need alerting, and which can simply be logged? If stakeholders receive too many alerts for issues that turn out to be ignorable, they’ll begin ignoring all alerts. To avoid this alert fatigue, be conscientious about choosing which errors to alert on.
What medium are you using? After you’ve carefully chosen only the most critical errors for alerts, don’t undermine all your work by sending them to a busy Slack channel or someone’s inbox where they’re easy to overlook. Make sure your alerts are distinguished from regular status reports and other routine information so the recipients know what to look at. Using a tool like PagerDuty can help you fine-tune your alert management to the level of severity and responsiveness that each one needs.
How timely are they? Do they get sent out at a certain time, or will they just show up? If your alert mechanism fails, will anyone notice?
Alongside your plan for producing alerts, you need to figure out who’s going to see and respond to them. Some elements to take into account are:
Who gets notified and when? Of everyone that touches your data—upstream producers, downstream consumers, pipeline owners—who’s relevant to a particular alert?
Who’s responsible for acknowledging and investigating? Knowing something is up is different from doing something about it, and there needs to be one point of ownership for each alert. Different alerts might have different owners, but without a clear plan, issues can be overlooked or ignored, causing frustration to everyone.
Consider whether an on-call rotation can help with clearly assigning responsibility—often, it will. Having a designated point person responding to alerts creates a clear chain of responsibility. Importantly, don’t let the phrase “on-call” give you a false sense of urgency: in this scenario, it’s entirely possible that being on call takes place entirely within regular business hours. Don’t set anyone up to get 2 am phone calls unless you genuinely need that level of responsiveness.
Are your notifications informative? Do the recipients of your alerts know not just what’s happened but also what implications it has? It’s especially important to ensure that your data consumers understand the ramifications of an alert and know what steps to take to get more information about the problem or potential solutions. Hint: having a clear point of contact, like an on-call engineer, often helps with this!
Now that you’ve found out about a test failure, it’s tempting to jump right into working on it and figuring out what’s going on. But it’s important to make sure that all the data’s stakeholders know something’s going on as soon as possible—ideally before they have a chance to see the effects in their own work.
This is pretty standard incident response stuff. But data quality problems often manifest in subtler ways than, for example, a web app bug causing a 404.
You’ll want to tune your communication to the severity of the issue and the stakeholders involved. Sometimes an automated alert might be good enough; other times, you might need a playbook where the team makes personal contact with affected groups.
It’s important to keep an open line of communication with your stakeholders so you can give them updates on the issue resolution process, answer any questions, and, if absolutely necessary, make ad-hoc fixes to address urgent data needs. (Since ad-hoc fixes add another opportunity for complexity in an area where you’re already having a problem, think twice—or even three or four times—before you decide to go there.)
Root cause identification
At a high level, data test failures fall into one of these buckets:
The data is actually fine: Sometimes, it’s the test that needs to be adjusted. This can happen when, for example, an unusual but correct outlier occurs for the first time since implementation.
The data is broken but fixable: A straightforward example of this is an incorrectly-formatted date or phone number. The key to ‘fixability’ in this sense is that the data can be repaired within the scope of the current pipeline.
The data is broken and not fixable: An example of this situation is when the data is missing values. This category isn’t meant to express that the data is unfixable in a universal sense but that you’d have to go outside of the current pipeline to do it: for instance, by using a data validation service to fill in missing address fields.
Data issues that arise at the data ingestion/loading stages often stem from changes that the data team mostly doesn’t control. Some common examples include:
Delayed data deliveries leading to out-of-date data.
Table properties like column names or types changing unexpectedly.
Values and ranges diverging from what’s expected because of changes in how the data is generated.
Another major cause of data ingestion problems is issues with the actual ingestion runs or orchestration: when processes hang, crash, or get backed up due to long runtimes. These problems often manifest themselves as stale data.
So how do you identify the root cause of data ingestion issues? The key is to be methodical: identify the exact issue that’s actually happening and then what’s causing that issue.
To be successful at this, you can’t take problems and test failures at face value. For example, a test for NULL values in a column could be failing because some rows actually do have NULL values … or because that column no longer exists. Those are two very different problems.
Once the problem is clear, then you can start investigating.
Of course, we can’t list every possible cause of a data quality problem in this post, but here are some good initial avenues to investigate:
Recent changes to ingestion code: Ask your team (or the team in charge of ingestion) if there have been any recent changes or updates, or go through your version control log.
Crashed processes or interrupted connections: Log files are usually helpful and the best place to start.
Delays in data delivery: AKA, is the problem the system you’re ingesting from or upstream from that? Check whether the system you’re ingesting from is getting the source data in time.
Upstream data changes: Check the state of the source data, and get the data producers to confirm whether it’s as they expect (or not).
Finally, while data ingestion failures are often outside of your control, test failures on the transformed data are usually caused by changes to the transformation code. One way to head off these unexpected side effects is to enable data pipeline testing as part of your development and CI/CD processes.
By enabling engineers and data scientists to automatically test their code against, for example, a golden data set, you can decrease the likelihood that data-quality-problem-producing code will get into production.
So you’ve identified the problem … how do you fix it?
It won’t surprise you that there’s no single approach to fixes that can work with every root cause.
But consider the three buckets of test failures we discussed earlier. Each of these buckets has an associated type of fix that can help your test go green again:
The data is actually fine: Update your tests to account for your new knowledge.
The data is broken but fixable: You could try rerunning your pipelines, potentially with increased robustness toward disruptions like connection timeouts or resource constraints. You can also change (fix) your pipeline code; ideally, you can also add a mechanism that allows engineers to test their code so the same issue won’t happen again.
The data is broken and not fixable: You might need to connect with data producers and have them reissue the data if possible. There might also be situations where you need to isolate the “broken” records, datasets, or partitions until the issue is resolved. And, especially when you’re dealing with third-party data, you might have to stop using the data for good: the producer might delete, modify, or stop updating the data to the point where it’s simply no longer suitable for your use case.
Now your tests pass, but …
Unfortunately, even when your tests are passing with flying colors, you need to consider that you might not be testing for everything. And it’s nearly impossible to write data tests for every single possible data problem before you encounter it for the first time.
That’s true for big, obvious factors too, not just very rare edge cases … as Abe Gong, Great Expectations’ CEO, can attest:
I once managed a daily ingestion pipeline that would alert if record counts dropped significantly from one day to the next, since that was usually our biggest concern. Little did I know that a bug in our pipeline would accidentally double the record counts in size, which besides some “hmm, those pipelines are running very slow today” comments aroused shockingly little suspicion—until a human actually looked at the resulting dashboards and noticed that our user count had skyrocketed that day.
So what can you do to make your test more robust against these “unknown unknowns”? We’ve been working on this problem, and while we can’t say we’ve solved it yet, here’s what we recommend starting with.
Automated profilers: Use an automated profiler to generate data tests, specifically to increase test coverage in areas that might not be totally obvious to you. For example, you might not consider testing for the mean of a numeric column if you usually define it in terms of min and max—but additional tests could help you detect distribution shifts within that range. Use separate test suites and thoughtful alerting levels to make sure you don’t overwhelm yourself with notifications.
Socialization: Don’t have engineers work on data tests solo. Socialize data tests within the team and do code reviews of tests whenever they’re added or modified, just like you would with actual pipeline code. This makes it easier to surface the pipeline team’s assumptions about the data and highlight any shortcomings in the tests.
Manual spot checks: Automation is great, but someone who’s operationally familiar with the data can often see when something’s “off,” even if there’s no test in place. A final step of your data quality strategy can be implementing periodic audits of your data assets, which includes manual spot-checking alongside profiler-assisted checks and, importantly, making sure that your tests are still complete, accurate, and actually running as expected.
In this post, we reviewed some key areas to consider when you’re implementing data validation in your pipelines. Developing and running tests is only one aspect of your data quality strategy, though admittedly an important one.
But just as important is what happens after those tests run: who gets alerted and how, assigning ownership of the alert response, communicating with stakeholders, doing effective root cause analysis, and actually fixing the issue. If you do these things well, they’ll take real time and effort: you need to give them just as much consideration as you do the tests themselves.