We've revamped Checkpoints!
We’ve made Checkpoints highly configurable first-class citizens in Great Expectations to meet all your validation needs.
January 28, 2021
You might recall seeing something like this when creating or running a Checkpoint in the Great Expectations CLI:
Well, we’re excited to announce that Checkpoints are now fully grown up and no longer experimental!
As of Great Expectations version 0.13.8, Checkpoints integrate the new logic and metadata from Batches, Validators and new Datasources released with Great Expectations 0.13, or what we call the “new” or “experimental” API. We extended Checkpoints to handle more use cases with far less boilerplate code, and while we were at it, we also tidied up the underlying classes and added a
CheckpointStore, which makes them a first-class citizen just like Datasources and Expectations. This is why we call them “class-based” Checkpoints in the documentation, as opposed to “legacy” Checkpoints. This is what you’re going to see in our documentation for now to distinguish the two versions:
This blog post covers the improved Checkpoints, and introduces how to create, configure, and run them.
What can you do with Checkpoints?
A Checkpoint bundles an Expectation Suite with a batch, or multiple batches, of data, which allows you to easily run validation and kick off any follow-up actions. Some examples of where you can use pre-configured Checkpoints to validate data include:
- Configuring a Checkpoint against a multi-batch data asset in order to run data validation on all batches, including new ones
- Validating data in an Airflow task using the Great Expectations Airflow operator, or other DAG runners by simply invoking a Python task
- Validating data pipeline code with Github actions when new data is published … and pretty much all other data validation needs.
Most importantly, please check out our Core Concepts documentation on Checkpoints, which is packed with awesome explanations and examples. Side note and invitation to contribute: as you can see from the linked docs, we’ve produced documentation for some of these use cases. If you’re in the mood to give back to the community, we’d really appreciate PRs documenting additional deployment patterns, either on this list or new ideas of your own.
First things first: This is a non-breaking change, so you can upgrade to 0.13.8 and continue to use your legacy Checkpoints as-is (including the “This feature is experimental” warning message). As usual, we’re releasing this version of Checkpoints as a v1, with some room to iterate, so there might be some smaller tweaks happening in future releases. These changes are based on a huge amount of community feedback; we’re confident that they are a big step in the right direction and that we’ll be able to smooth out any rough edges soon.
At the time of writing this blog post, there are two ways you can interact with Checkpoints, which depends on whether you’re using the “stable” Great Expectations API for key concepts such as Datasources, or using the “experimental” API:
- The Checkpoint-related CLI commands like
checkpoint runcontinue to work with configurations for concepts using the “stable Great Expectations API”. If you’re using Datasources from the stable API, your Checkpoint workflows won’t change.
- If you’re using Datasources from the “experimental” Great Expectations API, you can access new-style Checkpoints through code, as we are planning to switch over the CLI entirely to all new concepts in a future release. There’s just one thing you’ll need to do if you want to use class-based Checkpoints in 0.13.8: We’ve incremented the version number of the
great_expectations.ymlconfig file, which means you’ll have to run the CLI upgrade tool..
- There’s a third option: You can continue to use legacy Checkpoints via the CLI, but have them backed by a CheckpointStore if you upgrade the config version. This would allow you to store your Checkpoints somewhere other than the filesystem, e.g. in cloud storage.
You will see some warning messages regarding the new configuration file version, but that’s ok. Only upgrade when you’re confident you want to.
Creating and configuring Checkpoints
Let’s assume you’ve already configured a data asset
MyDataAsset and an Expectation Suite
my_suite, and you just want to create a Checkpoint that allows you to run validation of
my_suite. This is the configuration that’ll bundle the Expectation Suite and the respective batch request using the SimpleCheckpoint class, which takes care of a few defaults:
config = """ name: my_checkpoint config_version: 1.0 class_name: SimpleCheckpoint validations: - batch_request: datasource_name: my_datasource data_connector_name: my_data_connector data_asset_name: MyDataAsset partition_request: index: 0 expectation_suite_name: yellow_tripdata_sample_2019-01.warning """
Well, this looks pretty much like the old-school Checkpoint yml files we’ve seen previously - a batch and a suite, nothing special. They simply replace what was previously known as “ValidationOperators”. However, the real power of these new Checkpoints comes from their configurability. For example, we can add multiple
batch_requests to a Checkpoint to validate several assets with the same Expectation Suite, we can nest ValidationActions, add Evaluation Parameters, set the output type for validation results, and use templates for
run_name. See this epic example of a highly customized Checkpoint configuration using the
Checkpoint base class instead of
config = """ name: my_fancy_checkpoint config_version: 1 class_name: Checkpoint run_name_template: "%Y-%M-foo-bar-template-$VAR" validations: - batch_request: datasource_name: my_datasource data_connector_name: my_special_data_connector data_asset_name: users partition_request: index: -1 - batch_request: datasource_name: my_datasource data_connector_name: my_other_data_connector data_asset_name: users partition_request: index: -2 expectation_suite_name: users.delivery action_list: - name: store_validation_result action: class_name: StoreValidationResultAction - name: store_evaluation_params action: class_name: StoreEvaluationParametersAction - name: update_data_docs action: class_name: UpdateDataDocsAction evaluation_parameters: param1: "$MY_PARAM" param2: 1 + "$OLD_PARAM" runtime_configuration: result_format: result_format: BASIC partial_unexpected_count: 20 """
For more examples of the various configuration options for these new Checkpoints, take a look at our documentation!
Wait, what are ValidationActions again?
ValidationActions and ValidationOperators continue to exist inside of Checkpoints. However, we think of them as purely internal concerns. You will configure them within Checkpoints, but you would almost never instantiate or invoke them outside of Checkpoints.
This matters for extensibility. ValidationActions are pluggable actions that can kick off secondary processes after data validation, such as:
- Storing validation results
- Building Data Docs
- Triggering notifications, such as Slack notifications
Because Checkpoints wrap ValidationActions, you can configure them just like you used to be able to do. The
SimpleCheckpoint class actually defaults to the above action list! If we want more fine-grained control over which actions to run after validation, we can add a custom
action_list like in the above example:
config = """ ... action_list: - name: store_validation_result action: class_name: StoreValidationResultAction - name: store_evaluation_params action: class_name: StoreEvaluationParametersAction - name: update_data_docs action: class_name: UpdateDataDocsAction ... """
We expect the list of integrations in ValidationActions, such as the types of notifications to send, to continue to grow. If you have ideas, we’d love to help you contribute them back to the community. Please check out our contribution guide to get started!
Running Checkpoints is easy. We’ve designed them with two principles in mind:
- Minimal in-line code.
- Make them set-and-forget. Once you’ve configured a Checkpoint, you should be able to just run it repeatedly to validate your data assets, without requiring additional configuration.
You can simply run a new-style Checkpoint in code using the following snippet:
ge.get_context().run_checkpoint( checkpoint_name="my_checkpoint", )
As we mentioned above, we will soon integrate the new Checkpoints with the CLI, which means you will be able to trigger a run from the CLI too. Currently, the CLI
checkpoint run command still supports the legacy Checkpoints, which only operate on concepts from the stable Great Expectations API.
We hope this article gives you an idea of what to expect when using Checkpoints for validation. For more detailed information about new Checkpoints, please refer to the updated how-to guides in the “Validation” section of our docs, as well as the Core Concepts pages in our documentation.
You should star us on Github