As you use Great Expectations, the number of Validation Results maintained, indexed, and rendered for your Data Docs grows. When your Data Doc accumulation reaches a critical mass, you might start to see a degradation in performance while updating and rendering them.Â
Generally, the factor driving Data Doc accumulation is that you have run large numbers of Validations. However, the size of your data volume can also contribute: all else being equal, Data Doc rendering slows down sooner as the data volume being Validated increases.
If Data Doc accumulation impacts your GX performance, the impact applies to every run of every Checkpoint. This happens because GX’s default behavior is to re-render all Data Docs within a deployment whenever a Checkpoint is run, regardless of the Data Docs’ association with the run in progress.Â
To alleviate this issue without significantly reconfiguring your GX environment, you have several options.
Use UpdateDataDocs
If both of these statements are true:
The cause of your performance degradation is a large number of Validations in your overall environment.
The Validations being in run in your Checkpoint are not orders of magnitude larger than other Validations in the environment in terms of Expectation volumes or data volumes.
…you can make use of the
Checkpoint Action.
This replaces the
Action, and renders only new Validations only for the Checkpoint in which you took the Action. This makes it more performant than
.
However, because
1UpdateDataDocs`
does not act outside of the active Checkpoint, other Data Docs may not reflect any changes made to GX or to the local environment. If your Data Docs are being served live, it is possible that your requirements will make
UpdateDataDocs
unsuitable.Use validation_results_limit
In scenarios when UpdateDataDocs is not appropriate, you can improve performance by using GX’s
configuration option to specify the number of historical Data Docs that GX retains.
The
option is passed under the
of a given Data Docs site configuration. These are located under the
of your
file.Â
For example, limiting your Validation Results on a local Data Docs site to only the five most recent could look something like:
1data_docs_sites:2 local_site:3 Â class_name: SiteBuilder4 Â show_how_to_buttons: true5 Â store_backend:6 Â Â class_name: TupleFilesystemStoreBackend7 Â Â base_directory: uncommitted/data_docs/local_site/8 Â site_index_builder:9 Â Â class_name: DefaultSiteIndexBuilder10 Â Â validation_results_limit: 5
Using this technique, Validation Results from previous Checkpoints are only rendered and indexed up to the point that you specify.
If your performance speed reduction is due to your historical accumulation of Data Docs, using
is an easy way to alleviate the issue without sacrificing environment-wide Data Doc building.
It’s important to note that
doesn’t limit the total number of HTML documents contained in your Data Docs site. If HTML documents other than Validation Results are contributing to your performance degradation, their effects will not be reduced by
.
Use docs clean
If you determine that rebuilding your Data Docs site from scratch is the best way for you to control your HTML document volume, you have another option.
Periodically running
from the command line will delete all existing HTML documentation from your site.Â
Generally, you would only want to do this routinely if at least one of the following applies to you:
Data Docs other than Validation Results are a significant contributor to your document volume.
You do not want to automatically limit your Validation Results storage.
Summary
Great Expectations offers you several options for maintaining your GX deployment and optimizing the speed of your GX workflows:
, for controlling the volume of Data Docs generated by a Checkpoint.UpdateDataDocs
, for controlling the volume of historical Data Docs you retain.validation_results_limit
, for deleting all HTML files from your site.great_expectations docs clean
Together, these approaches can help you maintain the performance of your GX deployment over time.
Great Expectations is part of an increasingly flexible and powerful modern data ecosystem. This is just one example of the ways in which Great Expectations is able to give you greater control of your data quality processes within that ecosystem.
We’re committed to supporting and growing the community around Great Expectations. It’s not enough to build a great platform—we want to build a great community as well. Join our public Slack channel here, find us on GitHub, sign up for one of our weekly cloud workshops, or head to https://greatexpectations.io/ to learn more.