They say that good data is like oil—it’s essential to keeping your company running. If we follow that logic, then bad data is akin to a roadblock—something to avoid if you want to keep moving forward.
The problem is, keeping your company running on good data isn’t always a simple task. As data moves from one pipeline to the next, things can get tossed around (and lost) in the shuffle, lowering your overall data quality.
As companies work with larger and more complex datasets, keeping up with data quality isn’t just a matter of accuracy—it’s a matter of cost. Ultimately, data is generated by people, and people are naturally prone to error. In fact, data errors are more common than you might think. A study from Harvard Business Review found that almost half (47%) of newly created data records contain at least one critical error. Over time, these errors can wind up costing companies millions. It makes sense, then, that only 49% of data practitioners have high trust in their data, with one of the biggest barriers to that trust being the question of quality.
Now for the good news: Though you can’t exactly guarantee the prevention of all data errors, following a few key best practices can help you maintain high-quality data and avoid some nasty headaches.
What is data quality?
Let’s start with some level-setting around terms. Put simply, “data quality” describes the degree to which your data contains or is free of errors. When you’re working with any kind of data, you’ll find that data errors typically take one of three forms:
The inputs themselves are wrong.
Example: You’re at a university that keeps a database of alumni-turned-donors, but, in a handful of cases, a donor’s graduation year is listed incorrectly.
Your understanding of the data is wrong.
Example: The donation sums listed in the database show the amount donated in the previous year only, but some folks on your data team have interpreted those sums as all-time totals.
Your understanding of the data is outdated.
Example: The addresses of some of the donors have changed, but the database hasn’t been updated to reflect that.
The reality is that threats to your data quality are always present. Most companies today work with so much data that avoiding all issues is virtually impossible, even under the best circumstances. But if you can put systems in place to avoid the three scenarios described above, you can go a long way toward maximizing data quality.
Best practices for data testing
A smart approach to testing is the best route to strong data quality. Here are some data testing best practices.
Test early and often
Without regular testing, issues may only become apparent over time—and the longer a data error goes undetected, the more problems it can cause.
We once worked with a city property assessor’s office that was wondering why the taxes assessed for one of the homes in their database was listed well into the millions of dollars—far higher than expected. As it turns out, an employee had dropped their phone on their keyboard, causing that home’s value to be listed in the billions.
An error like this can take weeks for a data practitioner to catch. That’s why we recommend putting either manual or automated processes in place to make sure issues are flagged as early as possible.
Start with the basics
There are a variety of basic tests you can implement for all of your datasets within their respective schemas. These include testing for:
If you anticipate that every value in a table will be unique, it’s important to run a test that will immediately alert you to any duplicates.
Non-null of primary keys
If your ability to use your data depends on specific fields in the database being populated, you need to know right away if any of those fields are empty.
Foreign key relationships between tables
If keys are coded based on a system outlined in a separate table, testing is a good way to make users aware of that system to avoid confusion.
When in doubt, we recommend running these tests every time data moves between systems.
Don’t get carried away
It’s critical to weigh the potential tradeoffs between the performance of your tests and the pipeline you’re trying to facilitate. When you’re knee-deep in the testing zone, it can be easy to go overboard, expand your testing into areas that don’t have any meaningful impact on your goals and end up with very long-running and inefficient test suites. The more of your data you ask a test to investigate, the longer it takes that test to run—if you test for more than you need, you’ll end up waiting around for no good reason.
One thing that can be helpful here is approaching testing through a severity vs. likelihood matrix. Essentially, you want to test the elements of your data that are both more likely to fail and more likely to have a big impact if they do. For example, you’d want to run tests for accidental duplication—something that would skew your data significantly.
Dig deeper into key data
Once you’ve run basic tests, you can start to go deeper. Single out your most impactful data for more robust testing geared toward confirming that critical data points, such as the mean of your column or the distribution of values throughout it, meet your expectations.
Focus on the areas of your data that have no margin for error, like financial data. If you’re having trouble identifying which parts of your data are most critical, draw a map. Start with the data products you provide, such as users and dashboards, and highlight the most important ones. Then go upstream to identify which of your datasets they’re based on.
How to respond to testing outcomes
Let’s say you’ve got long-running pipelines in the works. You can test after parts of these pipelines, such as your staging models, first. If those tests pass—that is, they don’t turn up any data errors—then you can kick off any remaining pipelines. Conversely, you can also test that the input before a long-running model is correct.
So what happens if a test fails? First, remember that tests only detect issues; they don’t prevent them. The big point here, once again, is to test early and often. The further upstream you can test (and react accordingly), the more issues you can prevent down the line.
Other best practices
Now that we’ve covered testing, let's move on to best practices to prevent data issues from cropping up in the first place.
Create standard operating procedures
SOPs are a given in most business contexts, and they’re especially important in terms of upholding data quality. Anyone who works with your data should be testing it consistently—you don’t, for example, want some people writing tests at a higher level of thoroughness than others.
Your SOPs are critical to creating visibility and accountability and should include:
Guidelines for notifying teams of database changes
Design guidelines for database tables around critical questions such as how to handle deletes or what fields are controlled input vs. free text
Rules around what constraints are implemented in the database or ORM
One thing to note: many aspects of data quality SOPs can and should be automated. For example, anytime there’s a pull request with the potential to change sensitive parts of your code, it’s a good idea to automatically copy your data team on the request, so they can review it and ensure those changes won’t break anything.
Create a data contract
At Great Expectations, we believe data contracts are the best way to foster good data quality by creating a shared understanding of what the data is for. A data contract is a means of achieving alignment between data producers and data users on what makes the data fit for its intended purpose.
Aligning on a data contract can help create systems to prevent issues before they pop up. As we mentioned before, a big part of maintaining data quality is putting those kinds of systems in place.
If you’re like most organizations, you’re probably already using data contracts in some form – even if you don’t necessarily realize it. Still, it’s a good idea to make sure your contracts are comprehensive, covering the verbal, written, and automated phases of alignment.
Keeping up with data quality is an ongoing task. Whether you’re a small business working with limited datasets or an enterprise handling massive data lakes, implementing standard testing and prevention protocols can help keep your data working for you – not against you.
Did we miss anything? Let us know @expectgreatdata!