Great Expectations has more than 300 Expectations ready-to-use, not to mention all of the Custom Expectations users have implemented in their own deployments.
And out of all those, here are the GX community’s top data quality checks: the 10 most popular Expectations.
(How did we determine popularity? See the Methodology section at the end of this post.)
#1 Expect_column_values_to_not_be_null
We’ve talked about how missing data is one of the biggest data quality issues you can face, so it’s no surprise that expect_column_values_to_not_be_null was the most popular Expectation.
This core Expectation doesn’t need additional parameters to deploy: just apply it to your column and you’re ready to go.
Note that this Expectation only counts values that are explicitly missing, like a
in PostgreSQL or
in Pandas. Empty strings won’t be counted as a null unless they’ve previously been coerced into a null type.
#2 Expect_column_values_to_be_between
Describing the range of your data is another basic check, particularly for numerical or date/time data.
The allowed range is defined inclusively for this Expectation by default; you can use the
and
flags to change either (or both) bounds to be exclusive. Expect_column_values_to_be_between has been implemented for all supported backends except MySQL, so if you’re looking to brush up on your Python+MySQL skills this could be a great project to take on!
#3 Expect_column_values_to_be_in_set
For when your allowed values are a distinct and definable group, but not a range.
In this Expectation, you supply the allowed values as a set-like group of objects—'set-like' meaning any iterable group of values, including sets, lists, and tuples.
#4 Expect_column_values_to_be_unique
The name says it all for this Expectation. If you’re using expect_column_values_to_be_unique, it’s a good time to think about whether
parameter is relevant: 100% uniqueness can be a tall order, so it’s good to consider whether, say, 95% uniqueness is good enough.
This Expectation is also still waiting for MySQL support: if that piques your interest, it’s easy to get started with contributing.
#5 Expect_table_row_count_to_be_between
For the first time in this list, there’s a table-level Expectation! It doesn’t look at the individual records within your data, but at your batch of data as a whole.
A prerequisite for using this Expectation is, of course, some sense of how many rows you expect to get in a given batch—or rather, in your Batch.
Keep in mind that the GX Batch is a much more flexible concept than the generic 'batch': you can create Batches based on dates, times, unique IDs, ZIP codes—basically any characteristic that’s described by your data. Batches are key for contextualizing your data quality.
#6 Expect_table_columns_to_match_set
Another table-level Expectation, and a key one if you don’t have direct control of your upstream data: this Expectation identifies whether the columns you think should be there are, in fact, there and named what you think they should be.
Whether a column has disappeared entirely or just been renamed, you want to be able to make sure there aren’t going to be downstream ramifications.
Unexpected columns existing might or might not be a big deal—it’s worth monitoring them because if a column was simply renamed, you could find that out from the missing old name + unexpected new name combination.
(Obviously, confirm with your upstream colleagues, too: communication is essential to effective data quality.)
This Expectation’s default behavior is that there must be a column matching every name in the set and no column names that aren’t included in the set. You can include
, set to,
to allow column names that aren’t described in the Expectation to pass.
This Expectation also ignores column order: if you also want to check that your columns are in a specific order, try expect_table_columns_to_match_ordered_list instead.
#7 Expect_column_min_to_be_between
Expect_column_min_to_be_between lets you specify a range for the minimum value of your column. If you want, you can set both ends of the range to the same value, creating a fixed minimum.
The default behavior for this Expectation is for the range to be inclusive. If you want to define this Expectation as an exclusive or partially exclusive range, set
and/or
to
as needed.
Why would you set a range rather than a single minimum?
Because a single minimum means that the Expectation will still pass even if the actual minimum is much higher than the minimum bound—and there are many scenarios where’d you want to detect that.
#8 Expect_column_max_to_be_between
One way to describe this Expectation is “the previous one, but going the other way.” But to summarize why describe a maximum as a range rather than a single number:
By setting a single maximum value, you can’t detect scenarios where the actual maximum is much lower than you expect.
For example, a max of 100 won’t fail when the actual max is 20. If having an actual maximum of 20 is all right, then that’s fine. But if you expect your maximum to be at least 80, setting a range for your maximum is the only way to check that the top end of your values are as you expect.
#9 Expect_column_values_to_be_of_type
If your backend supports data typing by column, you can use this Expectation! You can also use it with Pandas if your column
and provided
are unambiguous constraints.
(Pandas users: that means this Expectation will work with any
other than
, or if both
and
are
. However, because Pandas column typing is relatively weak, this Expectation will independently check each row’s type, which could have significant performance impacts.)
Your backend determines exactly how you represent the type in the arguments for this Expectation.
#10 Expect_column_values_to_match_regex
To close out this list, we have expect_column_values_to_match_regex.
Famous quips aside, there are plenty of situations where you can do useful data validation using regular expressions.
This Expectation is how you can easily implement that validation. It isn’t supported on several backends yet, so if you love regex, this is a great opportunity to make it more widely available in GX.
Methodology: what is popularity?
Figuring out which Expectations are most popular seems like it should be simple, but—as is all too common in data quality—once you start thinking about it, there are more layers.
Is an Expectation’s popularity based on the number of deployments it’s used in? The number of Data Contexts that use it as a test? The number of times it’s created? Do you weight the results against Expectations that are featured in tutorials and documentation, because those might artificially inflate their numbers?
Rather than pick just one of these, for this blog we ranked Expectation popularity several ways with our anonymous usage statistics, across both GX Open Source and the Cloud Beta. This gave us five top-10 lists. We then counted how many lists each Expectation appeared on, and used that as the final ranking order.