5 eye-opening data quality tests

Popular Expectations show how even simple tests capture meaningful knowledge

Nick O'Brien & Erin Kapp

July 09, 2024

three lightbulbs surrounded by network/technology imagery in a way which is reminiscent of an eye

Ask an analyst to summarize what their data quality process does, and they might say it reviews data to make sure nothing unusual is happening. This sounds simple enough, but ‘unusual’ is a loaded word. And to unpack it, you need more than anomaly detection and automated machine learning. You need effective data quality tests—and no tests are more effective than Expectations.

Expectations are precise and directable. Rather than analyzing and monitoring the health of data pipelines and reacting when a problem occurs, they look at actual data, enabling you to identify root causes.

Tests evaluate your data against your own definition of what good data quality looks like, not some opaque algorithm’s. And when they fail, they create opportunities for your organization to refine that definition, building shared knowledge and driving effective collaboration.

Those opportunities—for collaboration and for defining and understanding data quality—are essential parts of an effective data quality process. While anomaly detection is a valuable tool for identifying unusual behavior in your data, it’s too blunt of an instrument to capture the nuance needed to truly have confidence in your data.

Here are some examples of meaningful knowledge about your data that popular Expectations can help you capture.

1. Validity: ‘Expect values to be in set’

Business requirements change. If, say, a user-facing app has new options, you’ll likely start seeing user activity data that, while historically anomalous, is perfectly acceptable. With this Expectation, you can immediately update your data quality testing to account for new-but-good data as soon as it starts happening.

This Expectation is useful even (especially) if the upstream data teams didn’t tell you they were making a change: Since you know immediately when an unexpected value happens, you can quickly reach out to confirm whether that value is actually bad.

‘Unexpected changes from upstream’ is one of the most common issues we hear about from users; this Expectation is one of the most widely used.

2. Uniqueness: ‘Expect compound columns to be unique’

Often, data is most usable from a technical standpoint when it’s decomposed into specific fields. But being broken into its component parts can make your data harder to understand and evaluate.

This is especially true with identifying information. Think: separating first and last names into separate fields, or a date into a year, month, and day.

This Expectation makes it easy to use multiple columns as an identifier of uniqueness without losing the usability that separate fields provides. (And without requiring you to add a concatenated field just for uniqueness validation, which just makes another column for you to monitor for quality.)

Expectations also make it easy to set the sensitivity of row-by-row checks like ‘expect values to be in set’ and ‘expect compound columns to be unique’ with their ‘mostly’ parameter. Learn more in this video.

3. Statistics: ‘Expect column min to be between’

Statistics are one of the most fundamental approaches to testing numeric or datetime data. Expectations make it easy for you to set a range for your statistical checks.

Why is this useful? Because it allows you to set guardrails against data that technically meets a simple floor-ceiling test but has distribution problems.

For example, if your dataset represents sales transactions, your natural impulse might be to set the minimum quantity purchased as 1 and the maximum as 500. But if you expect both large- and small-quantity transactions every day, then you could set this Expectation to make sure both happen at least once with min = 1-10 and max = 400-500.

If a simple floor-ceiling test is what you need, this Expectation can easily do that, too.

4. Uniqueness: ‘Expect unique value count to be between’

Counting unique values is a way of evaluating distribution for non-numeric, non-datetime data. In other words, considering the possible values for this field, how many different ones do you expect to see in any given dataset?

This Expectation helps you identify issues like:

A value you expect to be present in a field is not
A field you expect to have varied values is unexpectedly consistent

Similar to the statistical Expectations, this Expectation helps you create guardrails against unexpected distributions of valid values.

5. Volume: ‘Expect table row count to equal other table’

Transformations and migrations can be a precarious time in a dataset’s life.

One of the most fundamental ways to check whether your data made it safely through one of these transformations is to answer a simple question: Is there the same amount of data as before? This Expectation makes it easy to compare data volumes across tables and even data sources.

Migration and transformation checking are common use cases, but that’s not all this Expectation can do. There are plenty of other reasons you might expect two different tables to have the same number of rows.

Business-as-usual data can be problematic. That’s just one major reason why a complete data quality process needs something more nuanced on top of automated approaches like anomaly detection.

Expectations—Great Expectations’ data quality tests—help you build a collective repository of knowledge around your data. They tell you not just whether your data looks like it always does but whether it’s meeting your needs. And when they fail, they don’t just alert you—they start conversations that refine your organization’s concepts of good data and get everyone on the same page.

To see more Expectations and explore how you can combine them for even deeper insights, check out GX Cloud.