backgroundImage

Not all SQL is created equal

How Great Expectations guarantees data cross-infrastructure consistency

Great Expectations
September 03, 2020
sql image

Written by Abe Gong

We recently released full support for Microsoft SQL Server (MSSQL) in Great Expectations. This is a big step forward for the community, especially for teams that work within the Azure ecosystem and make heavy use of Microsoft's data tools.

By strict accounting, MSSQL is the 9th backend with full support for native validation of Expectations. (The others are Pandas, Spark, Postgresql, BigQuery, MySQL, Redshift, Snowflake, and SQLite.) Beyond that, several other dialects (e.g. Presto, Hive, Athena, dask) are similar enough that we've seen teams make good use of Great Expectations, even without full, formal support.

This post is a behind-the-scenes look at what it takes to get to full support for a backend in Great Expectations.

The first section starts from first principles: why cross-infrastructure consistency is both important and difficult for data engineering teams. This section should be useful to anyone interested in Great Expectations' cross-platform capabilities.

The rest of the article unpacks the work required to guarantee cross-infrastructure consistency. These sections should be useful for skeptics ("How do you know that Great Expectations is consistent across infrastructure?”) and contributors ("How can I help extend Great Expectations to work with more infrastructure?”)

Side note: The SQL Server integration was made possible by sponsorship from and collaboration with BERA Brand Management, a cutting edge market research company. Huge thanks to Tom Davidson and the BERA team for supporting open source and being great to work with. If you'd like to co-develop something similar, we are totally up for that.

Expectations are consistent across infrastructure

One of the core promises of Great Expectations is that "Expectations are consistent across infrastructure.”

Let's say you have a dataset, and you

  1. First explore it as a CSV on your laptop,
  2. Then store it as a parquet file to S3,
  3. Then ELT it into Snowflake for querying,
  4. Then load it into a Spark DataFrame to use as part of an ML model

At each stage of this journey, the bits representing your dataset are arranged in totally different ways, in totally different locations, on totally different infrastructure--but you still think of it as the same dataset. And because it's the same dataset at each step, you expect it to look and behave the same.

However, ensuring that your cross-infrastructure data flows meet those expectations create some thorny engineering problems.

As you test and document your data flows, you don't want to write validation code in a new dialect at each step: a python function here, a SQL snippet there, a different SQL snippet over there. That kind of polyglot testing is time-consuming, frustrating to maintain, and fails to capture the essential same- or different-ness of the data flowing through your system.

Instead, you should be able to express your expectations in a common language that follows your data through every step of your infrastructure. You should be able to trust that validating the same expectations against the same data will return the same results, even if the validation happens against different backends.

Very few data teams do all of their work within a single type of infrastructure. Instead, flexibility to move data across infrastructure unlocks the ability to choose the right tool for the job. You shouldn't have to choose between flexibility and testability--but that's exactly the dilemma that many data teams face today.

This is the problem that we are solving with Great Expectations by making Expectations consistent across infrastructure. Great Expectations is the only widely adopted modern data framework where these guarantees are baked directly into the core abstractions of the product.

Testing the tests

You can see this in our unit tests for the Great Expectations code base. The unit tests that evaluate the core logic for each Expectation aren't written independently for each backend. Instead, we generate tests from a set of curated test fixtures, plus configuration logic for loading them into various backends.

For example, here's the v0.11.9 test fixture for

1{
2 "expectation_type": "expect_column_values_to_be_in_set",
3 "datasets" : [{
4 "data" : {
5 "x" : [1,2,4],
6 "y" : [1.1,2.2,5.5],
7 "z" : ["hello", "jello", "mello"]
8 },
9 "tests" : [...]
10}
11

The fixture declares a simple data set to support testing, alongside a

tests
array. This array contains named test cases specifying inputs and outputs, such as:

1 {
2 "title": "Basic positive test case, number set",
3 "exact_match_out": false,
4 "in": {
5 "column": "x",
6 "value_set": [1,2,4]
7 },
8 "out": {
9 "success": true
10 }
11 }
12

When the Great Expectations unit tests are executed, the testing machinery loads the dataset into a variety of backends, executes each test case, and verifies that each one executes correctly.

Because full validation of proper execution is so important, we never use database mocks for unit testing. Instead, we use full instantiations of each backend, often using Docker images. In some cases, specific backends need to have a schema specified to ensure proper loading.

For example, the test fixture for

specifies a schema for loading a Spark DataFrame:

1{
2 "expectation_type" : "expect_column_value_lengths_to_be_between",
3 "datasets" : [{
4 "data" : {
5 "s1":["smart","silly","sassy","slimy","sexy"],
6 "s2":["cool","calm","collected","casual","creepy"],
7 "s3":["cool","calm","collected","casual", null],
8 "s4":[1,2,3,4,5]
9 },
10 "schemas": {
11 "spark": {
12 "s1": "StringType",
13 "s2": "StringType",
14 "s3": "StringType",
15 "s4": "IntegerType"
16 }
17 },
18 "tests": [...]
19}
20

In a few cases, test cases that can't be executed against a given backend are suppressed on a case by case basis. For example, test cases that mix types within a given column are often essential for pandas, but nonsensical in the more strictly typed world of SQL.

From

:

1{
2 "title": "Test on a series containing dates",
3 "exact_match_out" : false,
4 "suppress_test_for": ["sqlalchemy"],
5 "in": {
6 "column": "zz",
7 "min_value": "2/1/2016",
8 "max_value": "3/1/2016",
9 "parse_strings_as_datetimes": true
10 },
11 "out": {
12 "success": true,
13 "observed_value": "2016-02-01 00:00:00"
14 }
15 }
16

We built this unit testing approach into Great Expectations very early on and it has paid dividends ever since. For contributors to the project, it makes the behavior of each Expectation very clear. For users of Great Expectations, it is an excellent way to guarantee that validating the same expectations against the same data will return the same results, even if the validation happens against different backends.

SQL != SQL

Of course, besides the tests, Great Expectations also includes the business logic for executing Expectations. Typically, that means at least three implementations per Expectation, to support native execution within Pandas, Spark, and SQLAlchemy.

And within SQL itself, there's enough variation in dialects that SQLAlchemy is not a one-size-fits-all solution. For example, in MSSQL specifically:

  1. Unlike most SQL dialects, SQL Server does not support a regex operation, only a

    LIKE
    operation. When we implemented the SQL Server integration, we briefly considered shoehorning this behavior into the existing
    expect_column_values_to_match_regex
    Expectation, then realized that it would be impossible to guarantee consistent execution across all platforms. So we created a new
    expect_column_values_to_match_like_operator
    Expectation instead.

  2. INFINITY
    is not represented consistently in all DBMS engines. Some provide a predicate:
    Expression IS [NOT] INFINITE
    . Others support the
    IS_INF(X)
    operation. Still others support
    Inf
    and
    -Inf
    as special values. SQL Server’s support for the concept of
    INFINITY
    is very thin (MySQL is similar), so we had to create a
    get_sql_dialect_floating_point_infinity_value
    method with mappings from different databases to different values representing positive and negative infinity.

  3. SQL Server doesn't have a

    BOOLEAN
    type. Instead, it has
    BIT
    , which has slightly different semantics for evaluating truthiness. This led in turn to several new schema definitions and test cases in our unit tests.

These differences in dialect are a long-tailed source of conundrums for maintaining consistency. Developing Expectation APIs that are intuitive, expressive and consistent is a large part of the craft of Great Expectations.

Where next?

MSSQL is the latest dialect to gain full support within Great Expectations. There will be others. We will continue to listen to the community and develop further integrations as demand increases. We're also taking steps to speed up development for future integrations:

First, we've added a new How-To guide: "How to add support for a new SQLAlchemy dialect” to streamline the process for adding SQL dialects to Great Expectations. (Thanks to Dan at Earnest Research for excellent feedback!) We'll follow these steps internally when we integrate with new dialects. It'll also help contributors who want to develop integrations themselves. (If you're in this camp, please speak up in the #contributors channel and we'll be happy to assist.)

Second, we're excited for more paid partnerships to accelerate development of specific integrations within Great Expectations. If you'd like to sponsor the work to build a specific SQL dialect (or other integration), please check out our consulting services at superconductive.com. Once again, huge thanks to the team at BERA for their support and collaboration on the MSSQL integration.

Finally, we're also doing some cool stuff with cross-platform integration testing behind the scenes, but that's a story for another time. If you want to stay informed on future news and events involving Great Expectations let us know here.

The header image "Learning SQL" by therefromhere is licensed under CC BY-NC-SA 2.0

Search our blog for the latest on data quality.


©2024 Great Expectations. All Rights Reserved.