Announcing Great Expectations v0.4 (We have SQL...!)
Great Expectations can now execute in native SQL using the SQL Alchemy API. Thanks to everyone who made this (and a bunch of other cool additions) happen!
March 18, 2018
It’s been an awesome first month for Great Expectations! We shared the framework at Strata, on medium, and on twitter. The reaction was immediate and very positive.
Thank you! We’re glad you like it, and eager to keep building!
Based on feedback from the past month, we’ve revised, improved, and extended Great Expectations. 284 commits, 103 files changed, and 7 new contributors later, we’ve just released v0.4!
Here’s what’s new.
#1 Native SQL
By far the most common request we received was the ability to run expectations natively in SQL. This was always on the roadmap. The community response made it our top priority.
We’ve introduced a new class called
SQLAlchemyDataset. It contains all* the same expectations as the original
PandasDataset class, but instead of executing them against a DataFrame in local memory, it executes them against a database table using the SQLAlchemy core API.
This gets us several wins, all at once:
Since SQLAlchemy binds to most popular databases, we get immediate integration with all of those systems. We’ve already heard from teams developing against postgresql, Presto/Hive, and SQL Server. We expect to see lots more adoption on this front soon.
Since the SQLAlchemy API is consistent across databases, we can maintain compatibility with many databases with a minimum of new code in Great Expectations. (Note: it’s not unlikely that we will eventually have to include some non-standard code for specific databases. In that case, we can subclass
SQLAlchemyDatasetto keep the code footprint to a minimum.)
This approach takes the compute to the data. For pipeline testing to work in practice, expectations must be able to execute natively within whatever data processing systems people are working with already. Almost everybody uses SQL somewhere in their stack. Now Great Expectations can live there, too. Practically speaking, this means that teams that manage most of their pipelines in SQL can apply pipeline testing using the same expectation syntax that the Pandas version uses, without copying tables out of the database all the time.
Caveat: A moment ago, we said that
SQLAlchemyDataset “contains all* the same expectations as the original
PandasDataset class.” That’s technically true. However, they’re not all implemented yet. (See the release notes for the full list.)
We hope that some of you will find it in your hearts to help finish these NotYetImplemented expectations. (Because of the magic of decorators like
@column_map_expectation, implementing a new expectation is often just a couple lines of code.) If not, the core team will continue to chip away at them.
#2 A cleaner expectations result API
In GreatExpectations v0.3.*, there were several subtle but pervasive inconsistencies in the
result_objs that are returned from expectations. The biggest complaints that we heard from power users of Great Expectations revolved around these inconsistencies. (You can find details in Issue 175 and the release notes)
These are fixed in v0.4. This API cleanup puts the project on much firmer footing for future releases. For teams that have been using Great Expectations extensively: thanks for surfacing points of confusion and helping us resolve them. For teams that are just starting to use Great Expectations: trust us, we’ve just saved you a bunch of headaches down the road.
That said, we know that introducing a change of this kind will cause headaches of its own: it will break downstream code that consumes expectation results. To fix them, you shouldn’t need to do anything more complicated than unpack json objects differently. And you can always pin to version v0.3.2 as a temporary fix.
We promise we won’t do this too often. Please get in touch via Issue #247 if this migration gives you any trouble.
Other notable changes
Thanks, @dlwhite5 for diving deep into the pandas internals so that operations on a PandasDataset now returns another PandasDataset (instead of a regular pandas.DataFrame)
Thanks to @ccnobbli for implementing
expect_column_parameterized_distribution_ks_test_p_value_to_be_greater_than! This expectation allows you to compare a column against parameterized continuous distributions implemented in scipy (e.g. normal, Poisson, beta, etc.)
Thanks @schrockn for suggesting and implementing
ge.from_pandas(), to make pandas-to-great_expectations conversion more discoverable and user-friendly. We also implemented a top-level validate option
ge.validate() for the same reason.
Thanks @louispotok, for adding a
column_index parameter to
expect_column_to_exist, so that users can test column orderings.
Thanks to @rjurney for suggesting a
ge.read_json() helper function to read files that contain json lines.
We made also made some deep, behind-the-scenes improvements to the Great Expectations testing framework to ensure parity across data contexts. This is a big enough deal that it will probably get its own blog post soon.
Full release notes are here.
Onward and upward
Thanks again to everyone who contributed feedback and code to this release. Please keep it coming!
We’re excited to make Great Expectations more useful. Together, we will obliterate pipeline debt, once and for all.
This post was originally posted on medium.
You should star us on Github