backgroundImage

Limit validation results in Data Docs to customize your GX workflows

If Data Doc accumulation is impacting your Great Expectations performance, try one of these strategies alleviate it.

Austin Robinson
October 06, 2022
A photo of a small black and yellow turtle on a white desk
An environment overloaded with Data Docs tends to slow down. (📸: Foad Memariaan via Unsplash)

As you use Great Expectations, the number of Validation Results maintained, indexed, and rendered for your Data Docs grows. When your Data Doc accumulation reaches a critical mass, you might start to see a degradation in performance while updating and rendering them. 

Generally, the factor driving Data Doc accumulation is that you have run large numbers of Validations. However, the size of your data volume can also contribute: all else being equal, Data Doc rendering slows down sooner as the data volume being Validated increases.

If Data Doc accumulation impacts your GX performance, the impact applies to every run of every Checkpoint. This happens because GX’s default behavior is to re-render all Data Docs within a deployment whenever a Checkpoint is run, regardless of the Data Docs’ association with the run in progress. 

To alleviate this issue without significantly reconfiguring your GX environment, you have several options.

Use UpdateDataDocs

If both of these statements are true:

  • The cause of your performance degradation is a large number of Validations in your overall environment.

  • The Validations being in run in your Checkpoint are not orders of magnitude larger than other Validations in the environment in terms of Expectation volumes or data volumes.

…you can make use of the

UpdateDataDocs
Checkpoint Action.

This replaces the
BuildDataDocs
Action, and renders only new Validations only for the Checkpoint in which you took the Action. This makes it more performant than
BuildDataDocs
.

However, because
1UpdateDataDocs`
does not act outside of the active Checkpoint, other Data Docs may not reflect any changes made to GX or to the local environment. 

If your Data Docs are being served live, it is possible that your requirements will make
UpdateDataDocs
unsuitable.

Use validation_results_limit

In scenarios when UpdateDataDocs is not appropriate, you can improve performance by using GX’s

validation_results_limit
configuration option to specify the number of historical Data Docs that GX retains.

The
validation_results_limit
option is passed under the
site_index_builder
of a given Data Docs site configuration. These are located under the
data_docs_sites
of your
great_expectations.yml
file. 

For example, limiting your Validation Results on a local Data Docs site to only the five most recent could look something like:

1data_docs_sites:
2 local_site:
3   class_name: SiteBuilder
4   show_how_to_buttons: true
5   store_backend:
6     class_name: TupleFilesystemStoreBackend
7     base_directory: uncommitted/data_docs/local_site/
8   site_index_builder:
9     class_name: DefaultSiteIndexBuilder
10     validation_results_limit: 5

Using this technique, Validation Results from previous Checkpoints are only rendered and indexed up to the point that you specify.

If your performance speed reduction is due to your historical accumulation of Data Docs, using

validation_results_limit
is an easy way to alleviate the issue without sacrificing environment-wide Data Doc building.

It’s important to note that
validation_results_limit
doesn’t limit the total number of HTML documents contained in your Data Docs site. If HTML documents other than Validation Results are contributing to your performance degradation, their effects will not be reduced by
validation_results_limit
.

Use docs clean

If you determine that rebuilding your Data Docs site from scratch is the best way for you to control your HTML document volume, you have another option.

Periodically running

great_expectations docs clean
from the command line will delete all existing HTML documentation from your site. 

Generally, you would only want to do this routinely if at least one of the following applies to you:

  • Data Docs other than Validation Results are a significant contributor to your document volume.

  • You do not want to automatically limit your Validation Results storage.

Summary

Great Expectations offers you several options for maintaining your GX deployment and optimizing the speed of your GX workflows:

  • UpdateDataDocs
    , for controlling the volume of Data Docs generated by a Checkpoint.

  • validation_results_limit
    , for controlling the volume of historical Data Docs you retain.

  • great_expectations docs clean
    , for deleting all HTML files from your site.

Together, these approaches can help you maintain the performance of your GX deployment over time.


Great Expectations is part of an increasingly flexible and powerful modern data ecosystem. This is just one example of the ways in which Great Expectations is able to give you greater control of your data quality processes within that ecosystem.

We’re committed to supporting and growing the community around Great Expectations. It’s not enough to build a great platform—we want to build a great community as well. Join our public Slack channel here, find us on GitHub, sign up for one of our weekly cloud workshops, or head to https://greatexpectations.io/ to learn more.

Search our blog for the latest on data quality.


©2024 Great Expectations. All Rights Reserved.