backgroundImage

The 10 most popular Expectations

What aspects of data quality GX users look at the most

Erin Kapp
September 14, 2023
Cover card with "Top 10 most popular Expectations" in stylized text

Great Expectations has more than 300 Expectations ready-to-use, not to mention all of the Custom Expectations users have implemented in their own deployments.

And out of all those, here are the GX community’s top data quality checks: the 10 most popular Expectations.

(How did we determine popularity? See the Methodology section at the end of this post.)

#1 Expect_column_values_to_not_be_null

We’ve talked about how missing data is one of the biggest data quality issues you can face, so it’s no surprise that expect_column_values_to_not_be_null was the most popular Expectation.

This core Expectation doesn’t need additional parameters to deploy: just apply it to your column and you’re ready to go. 

Note that this Expectation only counts values that are explicitly missing, like a

NULL
in PostgreSQL or
np.NAN
in Pandas. Empty strings won’t be counted as a null unless they’ve previously been coerced into a null type.

#2 Expect_column_values_to_be_between

Describing the range of your data is another basic check, particularly for numerical or date/time data.

The allowed range is defined inclusively for this Expectation by default; you can use the

strict_min
and
strict_max
flags to change either (or both) bounds to be exclusive. Expect_column_values_to_be_between has been implemented for all supported backends except MySQL, so if you’re looking to brush up on your Python+MySQL skills this could be a great project to take on!

#3 Expect_column_values_to_be_in_set

For when your allowed values are a distinct and definable group, but not a range.

In this Expectation, you supply the allowed values as a set-like group of objects—'set-like' meaning any iterable group of values, including sets, lists, and tuples.

#4 Expect_column_values_to_be_unique

The name says it all for this Expectation. If you’re using expect_column_values_to_be_unique, it’s a good time to think about whether

mostly
parameter is relevant: 100% uniqueness can be a tall order, so it’s good to consider whether, say, 95% uniqueness is good enough.

This Expectation is also still waiting for MySQL support: if that piques your interest, it’s easy to get started with contributing.

#5 Expect_table_row_count_to_be_between

For the first time in this list, there’s a table-level Expectation! It doesn’t look at the individual records within your data, but at your batch of data as a whole. 

A prerequisite for using this Expectation is, of course, some sense of how many rows you expect to get in a given batch—or rather, in your Batch.

Keep in mind that the GX Batch is a much more flexible concept than the generic 'batch': you can create Batches based on dates, times, unique IDs, ZIP codes—basically any characteristic that’s described by your data. Batches are key for contextualizing your data quality.

#6 Expect_table_columns_to_match_set

Another table-level Expectation, and a key one if you don’t have direct control of your upstream data: this Expectation identifies whether the columns you think should be there are, in fact, there and named what you think they should be. 

Whether a column has disappeared entirely or just been renamed, you want to be able to make sure there aren’t going to be downstream ramifications. 

Unexpected columns existing might or might not be a big deal—it’s worth monitoring them because if a column was simply renamed, you could find that out from the missing old name + unexpected new name combination. 

(Obviously, confirm with your upstream colleagues, too: communication is essential to effective data quality.)

This Expectation’s default behavior is that there must be a column matching every name in the set and no column names that aren’t included in the set. You can include

exact_match
, set to,
False
to allow column names that aren’t described in the Expectation to pass.

This Expectation also ignores column order: if you also want to check that your columns are in a specific order, try expect_table_columns_to_match_ordered_list instead.

#7 Expect_column_min_to_be_between

Expect_column_min_to_be_between lets you specify a range for the minimum value of your column. If you want, you can set both ends of the range to the same value, creating a fixed minimum.

The default behavior for this Expectation is for the range to be inclusive. If you want to define this Expectation as an exclusive or partially exclusive range, set

strict_min
and/or
strict_max
to
True
as needed. 

Why would you set a range rather than a single minimum? 

Because a single minimum means that the Expectation will still pass even if the actual minimum is much higher than the minimum bound—and there are many scenarios where’d you want to detect that.

Here are several examples of scenarios where a minimum range detects something more meaningful than a simple minimum would.

#8 Expect_column_max_to_be_between

One way to describe this Expectation is “the previous one, but going the other way.” But to summarize why describe a maximum as a range rather than a single number:

By setting a single maximum value, you can’t detect scenarios where the actual maximum is much lower than you expect. 

For example, a max of 100 won’t fail when the actual max is 20. If having an actual maximum of 20 is all right, then that’s fine. But if you expect your maximum to be at least 80, setting a range for your maximum is the only way to check that the top end of your values are as you expect.

#9 Expect_column_values_to_be_of_type

If your backend supports data typing by column, you can use this Expectation! You can also use it with Pandas if your column

dtype
and provided
type_
are unambiguous constraints.

(Pandas users: that means this Expectation will work with any

dtype
other than
object
, or if both
dtype
and
type_
are
object
. However, because Pandas column typing is relatively weak, this Expectation will independently check each row’s type, which could have significant performance impacts.)

Your backend determines exactly how you represent the type in the arguments for this Expectation.

#10 Expect_column_values_to_match_regex

To close out this list, we have expect_column_values_to_match_regex. 

Famous quips aside, there are plenty of situations where you can do useful data validation using regular expressions. 

This Expectation is how you can easily implement that validation. It isn’t supported on several backends yet, so if you love regex, this is a great opportunity to make it more widely available in GX.

Methodology: what is popularity?

Figuring out which Expectations are most popular seems like it should be simple, but—as is all too common in data quality—once you start thinking about it, there are more layers.

Is an Expectation’s popularity based on the number of deployments it’s used in? The number of Data Contexts that use it as a test? The number of times it’s created? Do you weight the results against Expectations that are featured in tutorials and documentation, because those might artificially inflate their numbers?

Rather than pick just one of these, for this blog we ranked Expectation popularity several ways with our anonymous usage statistics, across both GX Open Source and the Cloud Beta. This gave us five top-10 lists. We then counted how many lists each Expectation appeared on, and used that as the final ranking order.

Search our blog for the latest on data quality.


©2024 Great Expectations. All Rights Reserved.