Metrics in GX: an introduction

What they are and why you should care

Erin Kapp

August 10, 2023

Never miss a blog

A picture of several metal gears, one of which has "metrics" stamped on it

📸: EtiAmmos via Adobe Stock

This blog is about Metrics. That’s Metrics as in 'the specific GX Metric object,' not 'the generic concept of metrics.'

Most GX users won’t ever need to interact with a Metric. You’ll really only encounter Metrics directly when you’re deep in an Expectation’s code or creating a Custom Expectation.

So this post is a primer aimed at the subset of advanced GX users who are doing that under-the-hood Expectations work. And, of course, at anyone who’s interested just because.

If you were hoping to hear about generic metrics, can I interest you in this blog post instead?

What is a Metric?

Metrics are a key component of Expectations. One easy way to define a Metric is:

A Metric is an answer to a question you have about your data.

… where the question is part of your Expectation.

Minimalist Metrics

For a simple example of how a Metric relates to an Expectation, let’s consider expect_column_max_to_be_between. You use this Expectation to describe an acceptable range of values (provided by you) for the column’s maximum.

To determine if this Expectation is being met, GX needs to answer a question about the data: what is the column’s maximum value?

With the answer to that single question, you can get the results of the Expectation. So this Expectation needs just one Metric,

column.max

Similarly, you can determine the results of expect_column_unique_value_count_to_be_between if you answer how many unique values does the column have?—which corresponds to the sole Metric

column.distinct_values.count

Those examples are straightforward because those Expectations produce a single overall statistic or result for each Batch they evaluate. The Expectation is passed or failed based on that one answer.

But Expectations can also produce a pass/fail for each row, with the Expectation’s results based on the totality of the row results. Getting that kind of answer entails asking more than one question, which means more than one Metric.

With this kind of Expectation—which here we’ll call ColumnMap, after the class that’s used to implement them—we start to see multiple Metrics.

Metrics for (Column)Maps

In a ColumnMap Expectation, you’re evaluating individual rows. If all the rows pass, the Expectation passes.

So the main questions you’re asking about the data as a whole are:

How many rows are there?
How many rows don’t meet the validation criteria?
How many invalid values are there?
What are the invalid values?

Generally, these questions show up in a ColumnMap Expectation as the following Metrics:

table.row_count
column_values.nonnull.unexpected_count
column_values.
expectation_name.unexpected_count
column_values.
expectation_name.unexpected_values

The first two questions and their respective Metrics are straightforward:

table.row_count

starts turning up to report the total rows. It’s usually accompanied by column_values.nonnull.unexpected_count with the number of rows that fail, though in some scenarios you’ll see column_values.null.unexpected_count instead.

The answer to these two Metrics are what you need to determine if the Expectation is passed.

Strictly speaking, you don’t need to ask how many unexpected values there are (

column_values.

expectation_name.unexpected_count) or what they are (column_values.expectation_name.unexpected_values). But without this information a failed ColumnMap Expectation can’t provide you with any context about the failure; in practice, you should always ask these questions.

For many ColumnMap Expectations in the Expectation Gallery, such as expect_column_values_to_be_increasing, these four Metrics are the ones you’ll see:

expect_column_values_to_be_increasing gallery Metrics

Metrics & `mostly`

There’s one more aspect to consider for ColumnMap Expectations: they can use the

mostly

parameter.

Using

mostly

allows you to set a threshold for the percentage of rows that have to pass in order for the Batch as a whole to pass. The default, without mostly, is 100%.

Using

mostly

allows you to pass data even if it’s less than perfect, while still specifying a point at which the data will no longer be ‘good enough.’ It’s calculated using the same table.row_count and column_values.nonnull.unexpected_count Metrics that the default pass/fail behavior uses.

Making Metrics

We’ve talked about Metrics as the answers to questions. It’s natural to ask if the Metrics also calculate those answers.

In short: no. This is where the MetricProvider steps in.

As we start talking about calculating, recall that GX can use different Execution Engines. And Pandas, Spark, and SQLAlchemy will each need different code to carry out the same calculation... so actually each Metric needs multiple calculations.

MetricProvider handles the connection between the Metric and the appropriate Execution Engine. To quote the MetricProvider conceptual guide:

To allow Expectations to work with multiple backends, methods for calculating Metrics need to be implemented for each ExecutionEngine. For example, [calculating the mean in] pandas is implemented by calling the built-in pandas
.mean()
method on the column, Spark is implemented with a built-in Spark mean function…
…the inputs for MetricProvider classes are methods for calculating the Metric on different backend applications. Each method must be decorated with an appropriate decorator. On
new
, the MetricProvider class registers the decorated methods as part of the Metrics registry so that they can be invoked to calculate Metrics.

That concludes this intro to Metrics in GX! You can read more about implementing a Metric here, or check out the rest of our documentation.

Metrics in GX: an introduction

What they are and why you should care

Erin Kapp

SHARE THIS ARTICLE

Never miss a blog

What is a Metric?

Minimalist Metrics

Metrics for (Column)Maps

Metrics & mostly

Making Metrics

Search our blog for the latest on data quality.

Metrics & `mostly`