backgroundImage

The Execution Engine explorer

A look at this mostly-under-the-hood part of GX

Erin Kapp
September 28, 2023
A photo of the engine bay of a turquoise car
Great Expectations email sign-up

Hello friend of Great Expectations!

Our email content features product updates from the open source platform and our upcoming Cloud product, new blogs and community celebrations.

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

What’s an Execution Engine?

In Great Expectations, an Execution Engine is the system that you use to process the data from a particular Data Source.

GX supports three Execution Engines: Spark, Pandas, and SQLAlchemy.

How do I choose an Execution Engine?

You don’t really ‘choose’ an Execution Engine with our Fluent Data Sources. Your Execution Engine is tied to your data’s location:

  • For data in a SQL database, SQLAlchemy will be your engine.

  • For data in Spark, you’ll use the Spark engine.

  • For data in-memory, Pandas is the right engine.

  • For data in files (like Parquet or CSV), you choose between Spark and Pandas, depending on your preferred workflow.

When you create an FDS, the appropriate Execution Engine will be attached to it automatically.

Only one of those options had an actual choice.

Correct! The Execution Engine is about where your data lives; if the place your data lives is decided, so is your Execution Engine. 

It’s entirely possible to set up a production-grade GX deployment without ever consciously thinking about Execution Engines.

So why do I need to know about the Execution Engine?

Often, you might not. But there are a few scenarios where it’s good to at least be aware of the Execution Engine:

  • If you’re working with files instead of a database.

  • If you haven’t chosen a backend for your database yet. 

  • If you want to use query Expectations.

Choosing a backend database

The pros and cons of specific backends extend well beyond GX. But if maximizing the in-place compute for your GX testing is a top priority for you, here's a few things to keep in mind.

GX always does as much computing as possible in place where the data lives. This compute is carried out by the Execution Engine.

If you want to maximize the amount of in-place compute, choose Spark or SQL as your backend and make sure there’s plenty of compute space available there. This will ensure that your Execution Engine maximizes in-place compute.

If you use Pandas, the normal behavior of a dataframe will mean that all of the compute happens on the machine where you’ve installed GX—which, while still 100% under your control, isn’t the same as the place where the data resides.

Query Expectations

Query Expectations—Expectations that use the QueryExpectation class—are a special subset of Expectation. They allow you to put a simple wrapper Expectation around your arbitrarily complex Spark or SQL logic.

We see this mostly from people who are already doing complex Spark/SQL analytics that they don’t want to recreate, but they do want benefits of GX like Data Docs.

That said, if you’re one of the people who can benefit from using query Expectations, you likely already know it.

Because query Expectations are tightly tied to a specific Spark or SQL dialect, we recommend them only for the scenario where you have extensive pre-existing work.

Can I use more than one Execution Engine in my GX deployment?

Yes! Because Execution Engines are a per-Data-Source thing, you can bring together data from all your different backends into your single GX deployment. The Execution Engines will handle each Data Source individually, so all your data will play nicely together.

Summary

The pros and cons of where you store your data go way beyond any one factor. GX can work with your data wherever it is, and it’s the Execution Engines that make that possible.

Whether you’re using Pandas, Spark, or SQL, the appropriate Execution Engine is automatically attached when you create your Data Source, and GX handles everything else under the hood.

The Execution Engine isn’t something that most GX users will have to think about regularly, if ever. But there are a few situations, like those we covered in this blog, where some up-front knowledge of the Execution Engine’s role in GX can come in handy. 

Search our blog for the latest on data quality.


©2024 Great Expectations. All Rights Reserved.