When Great Expectations tests my data, does it move it? Copy it? Change it?
We get these questions a lot, and the short answer is: no! Not at all.
In fact, GX actively avoids doing any moving, downloading, copying, or alteration on the data it tests.
In this post, we’ll answer some of the most common questions we get about this.
GX does not download or move your data
We understand why people wonder about this.
But GX only ever uses resources you already control. Whenever possible, it’s the in-place location where the data is stored.
Here are some specifics of how that manifests for different Execution Engines.
Spark Datasource behavior
GX works with Spark by using Spark-native functions and building Spark queries, which it executes against the data in your Spark instance.
Throughout this, your data stays wherever you decided to put it. For example, this might be your S3 bucket, your Databricks instance, or the machine local to your GX installation.
If your data’s location can’t support compute, GX will temporarily persist it somewhere that can. The compute location is always somewhere you already control, and the data is not removed from its bucket during compute.
With Spark, GX pushes compute back to your Spark instance whenever possible. If it’s not possible, GX will perform the compute on the machine local to your GX installation.
Summary: with Spark, all data and compute needed for a GX test takes place only in your data’s original DB, your Spark instance, and (only if needed) the machine local to your GX installation.
Pandas Datasource behavior
GX works with Pandas by leveraging the Pandas tools that operate on your data locally.
Throughout this, your data stays wherever you decided to put it: for example, your S3 bucket.
For Pandas, the compute location is always the machine local to your GX installation. This is the standard behavior of a Pandas DataFrame.
Summary: with Pandas, all data and compute involved in GX testing takes place in your data’s original DB and the machine local to your GX installation.
SQL Datasource behavior
GX works with SQL by using SQLAlchemy to build and execute queries against your data in your DB.
Compute is pushed back to your SQL DB whenever possible. If necessary, GX will temporarily persist data to the machine local to your GX installation as part of the compute.
Those are the only two places where SQL Datasource data and compute will be. GX does not move the data from your SQL DB to a Pandas DataFrame. We aren’t sure where this idea is coming from, but it’s somewhat common, and it’s wrong.
Summary: with SQL, all data and compute involved in GX testing takes place in your SQL DB and (only if needed) the machine local to your GX installation.
Typically, an Expectation does some sort of last-mile operation on the metadata that was generated by the queries and operations against your data. The results of this operation are used to construct your Validation Results and Data Docs.
These last-mile operations happen on the machine local to your GX installation by necessity.
Calculations a specific backend doesn’t support
The core Expectations can all be carried out as described above. Some of the experimental Expectations rely on complex calculations that may not be natively supported by a given backend.
In this circumstance, the data being operated on might be temporarily brought to the machine local to your GX installation. GX then completes the calculation there, using Python-native objects, numpy arrays, or Pandas DataFrames, depending on the preference of the community member who contributed the Expectation. When the calculation is complete, the temporary data is deleted.
GX Cloud is a managed SaaS product, but it still is almost identical to GX OSS (described above) in behavior.
Cloud doesn’t host your data, and doesn’t copy your data or DataSources to a GX-controlled machine or environment. Your primary data (that is, the data you are testing) is never passed through or persisted to any GX system.
Similarly, compute does not happen on the GX-managed environment; it takes place in Spark/your DB when possible and the machine local to your GX installation otherwise.
The only difference between Cloud and OSS is that, because Cloud displays your Validation Results and Data Docs in its user interface, Cloud does maintain and store the metadata that generates those things.
This metadata is generated from your data in the course of creating your Expectation Suites and validating your data. It’s entirely configurable: you can specify exactly what metadata Cloud is or isn’t allowed to retain.
Summary: GX Cloud hosts only Expectation Suites, Data Docs, and other aspects of GX that are generated by metadata. You can specify which metadata is or is not allowed in the GX-hosted environment. Your actual data and compute for GX Cloud takes place as described above for the relevant Execution Engine.
Sometimes people have concerns specifically about the compute stage. Here’s a summary of the compute-related information:
Compute always happens somewhere you control. Whenever possible, this location is your Spark instance or DB. In all other cases, it’s the machine you’ve installed GX on.
If you use Pandas, the data will be pulled to a DataFrame from your bucket, which is the normal behavior of a Pandas DataFrame. GX has nothing to do with this.
GX does not change your data
GX never modifies your data in situ.
If you’re using a SQL DB, GX will create temporary tables in it to enable certain Expectations and improve performance. Typically, these tables are released after Validation or within 24 hours of their creation. They aren’t required, so you can disable them.
This blog post has more information about why we don’t do data transformations as part of Expectations. Since its publication in 2021, we’ve actually taken this idea even further: we no longer allow runtime querying of your data as described in that post. You now have to define your assets with a query if needed.
Great Expectations doesn’t download or move your data, host your compute outside of your systems, or modify your data.
In fact, GX is purposefully engineered so it can test your data without removing it from your systems. Everything possible happens in the location where your data is actually being stored. Everything else takes place on the machine local to your GX installation.
No matter which deployment of GX you use, you can be confident that your data never leaves your control. With GX OSS, no GX-hosted or -controlled system ever stores, copies, or computes on your data. With Cloud, only metadata (which you can configure) is stored on the GX-hosted platform.
If you had any questions about how GX interacts with your data, we hope this answers them!
You can reach the GX team via our community Slack.