Great Expectations now supports execution in Spark!
A blog post with much clapping...
May 24, 2019
Trigger warning: this post contains clapping.
We are super excited to announce the release of one of Great Expectations’ most requested and most anticipated features: Spark execution.
Now you can execute Expectations natively within pandas DataFrames AND SQL AND Spark DataFrames.
Massive thanks to Christian Selig, who saw the need and took the initiative to make Spark integration happen. With the support of his coworkers at Cascade Data Labs, Christian shouldered the heavy lifting to bring Great Expectations to a major part of the data ecosystem.
But wait, there’s more…!
Working together over the last few months, we’ve learned that Christian is a true craftsman. Not content to simply re-implement the GE syntax in a new execution context, he also worked closely with James to make other major changes to GE’s inner workings.
Specifically, they’ve implemented a clean separation of concerns between engine-specific and engine-agnostic computation. This has several benefits.
- First, it allows code-reuse across Great Expectations’ execution engines.
- Second, it enables caching to avoid redundant queries (e.g. when different percentages share the same denominator.) This feature is not enabled by default, but we plan to do more with it in the future. In the meantime, it’s there for power users who want to play.
- Third, it opens the door to further optimization.
From a developer experience standpoint, the immediate consequence is that:
- Spark Expectations run considerably faster than they would have with redundant queries.
- Some Spark Expectations also run considerably faster for pandas and SQLAlchemy.
- We’ve piggybacked on the Spark logic to enable distributional expectations within SQLAlchemyDataSet. This has been another hot request from the community. We’re very happy to make it real.
- Even less boilerplate required for adding new Expectations. Expectation argument validation logic is more centralized than in the past.
In the long term, it opens the door to substantial optimization in the GE execution layer itself. This is an area that we haven’t pushed aggressively yet. As we see Great Expectations being used in ever-bigger production systems, we’re starting to feel the need to introduce more scalability and optimization. (Dask, anyone?)
As a kicker, Michael Armbrust (PMC for Spark and original developer of SparkSQL) gave Great Expectations a shout out from the last Spark AI Summit.
Nobody knows the Spark internals, ecosystem, and roadmap better than Michael. His tweet led to a fun conversation last week, in which he generously offered to review the Spark release and help us think through future possibilities for optimization. Spark’s optimization engine is phenomenal, so we have high hopes that this will lead to good things shortly.
However, this idea just came up a few days ago and Michael hasn’t actually done it yet. So he gets a heartfelt but restrained golf clap.
As with other releases, we have NOT implemented every single Expectation. After all, there are 47 of them now. (There are also a handful of straggling Expectations that still need to be implemented in SQL.)
We’ve decided to do it this way so that we can release faster. It also gives others a chance to participate in the project.
However, that chance is passing quickly. We’re working with partners and coding bootcamps to fill in all of these gaps during June and July. If you’d like a piece of the action before then, please jump in!
Other notes from this release:
- Some under-the-hood cleanup that brings all of the distributional expectations into the same cross-backend testing framework.
- We added vectorized implementations of several expectations in pandas, thanks to RoyalTS.
Look forward to more goodness and surprises from Great Expectations soon!
You should star us on Github