backgroundImage

5 more eye-opening data quality tests

Expressive tests help you capture unmatched depth of knowledge

GX team
November 21, 2024
a rendering of 3 blue lightbulbs with glowing lines emanating into geometrics spheres

To take meaningful action, you need meaningful information.

In other words, it’s really hard—and sometimes impossible—to fix a problem if you don’t know what the desired outcome is. If there’s a good argument to the contrary, we’ve yet to hear it.

Your data’s quality is no exception. 

When you have an unexpected data quality failure, you don’t just need to know that there’s a problem. You need to know what the data should like, and why it should look like that. And that why should be informed by all of the data’s stakeholders, not just one or two—or even worse, by an algorithm and not any stakeholders at all.

But good news: this high-information environment is completely achievable when you start using expressive, explicit, and extensible data quality tests. And the most expressive, explicit, and extensible data quality tests of all? 

Expectations, of course.

Here are 5 more eye-opening data quality tests that take the expressivity up a notch, and show how specific data quality tests can help you define nuance in your data and produce high-context, actionable alerts. 

1. Distribution: Expect column values z-scores to be less than

Plenty of data should be normally distributed… and plenty more shouldn’t be. Testing your numeric data’s z-scores can help make sure your column’s data has the right shape.

If you are expecting a bell curve (or bell-curve-ish) distribution, the Expectation “Expect column value z-scores to be less than” is what you need: just set a z-score threshold and make strategic use of the “mostly” parameter. Use the Expectation’s “double-sided” feature to compare the absolute value of your z-scores to the threshold and cover both ends of the distribution at once.

To check for a less-dense-than-normal distribution, you can set your “mostly” parameter higher to allow a larger proportion of values larger than the z-score threshold; for a denser-than-normal distribution, reduce “mostly” to permit fewer values to exceed your threshold.

If you aren’t expecting a bell-curve-like distribution, “Expect column value z-scores to be less than” can still help! Uncheck the “double-sided” parameter (or set it to

False
if you're using GX Core) and you can configure your checks for the two ends of your distribution differently.

Need to skip 0s or some other placeholder number? Filter out these undesirable rows using an Expectation condition.

2. Uniqueness: Expect column proportion of unique values to be between

Value uniqueness is an important way of evaluating distribution for non-numeric data. The “Expect column proportion of unique values to be between” Expectation helps you identify situations where your data is too unique (or not unique enough) given your expected data volume. 

Why do we set the threshold “to be between” rather than a simple “to be”? Because setting a threshold as a range rather than a single number is a way to guard against some distribution problems. 

For example: if you expect your column to be 20-30% non-unique values, you might set a threshold of at least 70% unique values. But if a batch of data comes in with 99% unique values, that technically meets your threshold just fine… but you also have many fewer non-unique values than you expected.

A flat threshold wouldn’t catch this second situation. But if you use “Expect column proportion of unique values to be between” to say that you expect 70-80% unique values, you can catch data that’s not unique enough and that’s too unique in a single test.

If the single-number threshold really will do the job, you can do that with this Expectation (and all “...to be between” Expectations) too.

3. Schema: Expect column values to be of type (list)

The data horror story about a team that resolved any and all data quality errors by converting the problem column to `varchar` haunts us to this day.

Data quality testing on your column’s value type(s) can help you avoid living your own version of this terrifying tale. Since, as perfectly illustrated by this story, you can’t necessarily rely on column typing in the warehouse for type enforcement. But using “Expect column values to be of type,” you can know as soon as the first variant piece of data enters the column.

And, of course, if your data isn’t coming from a warehouse that enforces typing, “Expect column values to be of type” can help you do that enforcement.

Maybe you have type enforcement built into your warehouse, plus total control and insight into changes to your warehouse. It’s still worth considering what a close cousin Expectation— “Expect column values to be of type list”—can tell you for columns that do legitimately need to be

varchar
.

Specifically: if your

varchar
column needs to accommodate several types, but not all types, “Expect column values to be of type list” can help you take action if (when) those other types sneak in.

In this Expectation, specify your data types in the way that the data source you’re evaluating expects and allows. That means no TINYINT if you’re testing PostgreSQL data, for example.

4. Numeric: Expect multicolumn sum to equal

To quote ourselves, “often, data is most usable from a technical standpoint when it’s decomposed into specific fields.” And while we originally said this about string data, it holds true for numeric too.

The simplest example? Percentages split across columns in proportions that will differ row-to-row, but should always equal 100%.

This Expectation lets you make sure that all the numbers literally add up while preserving the usability of multiple columns. 

Is “every row” a little too strict? The “mostly” parameter, “ignore row if” filter, and the equality filter let you apply this Expectation to a subset of the rows in your data.

5. Validity: Expect column values to match like pattern (list)

Another important tool in your toolbox when it comes to validating your data is pattern matching. 

It’s especially useful in scenarios where data can originate from a variety of sources, and therefore might contain multiple ways of formatting the same data. Think phone numbers: (999) 999-9999, 999-999-9999, 9999999999, and the list goes on.

Pattern matching can be used to validate that only allowed options have been used. But particularly when you’re standardizing data, pattern matching can also be used to check that the output of a standardization transformation is what you expect, and you haven’t failed to address any options (999.999.9999). Or, even harder to detect without pattern matching, that a transformation hasn’t had a typo introduced and started producing unexpected results.

It’s easy to do pattern matching using the “Expect column values to match like pattern” and “Expect column values to match like pattern list” Expectations. Use the “...list” option when you have multiple valid patterns.

The “like” part of the Expectation name refers to a SQL

LIKE
operator. You define the pattern(s) that you want to match in these Expectations as if you were creating a pattern for
LIKE
in a SQL statement.

When using these Expectations to check the outputs of transformations, consider also deploying uniqueness checks to further guard against transformations that accidentally make too much (or not enough) change.


Knowing that you have data quality issues isn’t enough. You and your team have limited time and limited resources: you need to be able to triage what’s going on and then act effectively.

Data anomaly detection tests can give you low-information alerts, but those don’t help you with triage or your response. Only expressive data quality tests can encode the specific knowledge you need to take appropriate action.

Expectations are the most expressive way to do data quality testing, and they’re designed to be accessible to all stakeholders to maximize the effectiveness of your collaboration while creating, maintaining, and responding to them.

Want to see Expectations with more context than a top-5 list? Our data quality use case tutorials show the key Expectations for addressing different types of quality issues:

GX Cloud is the easiest way to start using Expectations. It’s a fully-hosted SaaS platform—you can be validating your Snowflake, Databricks, and Postgres data within minutes of signing up. 

Or learn more about GX Core, our open source Python framework (and the most popular data quality framework in the world).

Search our blog for the latest on data quality.


©2024 Great Expectations. All Rights Reserved.