Where to use data quality tests with relational data sources

Don't worry about what counts as a production system: test data every time it's moved or transformed.

Sarah Krasnik

January 11, 2023

Never miss a blog

A generic grid with interconnections and one of the interconnections having other connections within/around it

Test your data every time it moves or is transformed, within and between sources. (📸: Growtika via Unsplash)

When should you do quality testing on your relational data, and what data sources should you target?

When software engineers think of relational data, most think about traditional transaction-oriented databases like PostgreSQL, MySQL, or MSSQL. For analytics teams, interactions with relational data happens mostly in analytics- and metrics-oriented databases or warehouses like Snowflake, Databricks, or BigQuery.

Either type of system can be considered production, depending on the context. How do you decide which one to target for quality testing?

For best results: you don’t. Data quality is most effective at catching issues before they can impact business operations if there’s quality checking every time that data moves between systems.

Real production pipelines are complex. But they can be characterized in terms of three major data interfaces, which are applicable to a wide variety of situations. The most typical is:

Product application to analytics
Enrichment and transformation of raw data
Analytics to business (reporting or third-party application)

Let’s dive into each situation in reverse order, starting with the most end-user-facing.

QA analytics output before business teams see it

The last stop for transactional data is usually in the form of some dashboard. By the time data reaches a dashboard, it represents an event or entity which the business is fundamentally interested in: a sale, a customer, a product, and so on.

Since dashboards are an easy way for cross-functional teams to understand the behavior of these entities, they’re often used to inform decision-making. Even if there are data quality checks implemented elsewhere, testing data right before it reaches a report is the last stop to ensure bad data doesn’t make its way into the hands of decision-makers. Making decisions on bad data is worse than making decisions without data at all.

Testing at this stage should try to mimic the domain knowledge of the dashboard consumer. For example, write reasonableness tests that check whether data that’s expected to be consistent over time actually is. This could be accomplished by using range tests for numbers or valid values tests for string fields.

Validate transformations in the warehouse

This location is easy to overlook because many architecture diagrams elide the complexity of enrichment activities like joining together raw data, aggregating, and rejoining—which is reasonable in principle, if you want a readable diagram. But if that simplification causes your data quality planning to overlook what’s really happening, it can seriously hamper the effectiveness of your data quality.

In practice, data is constantly enriched in systems like Snowflake, Databricks, or BigQuery. Any time data is transformed, it should be tested. No matter how small the transformation may seem, it can always grow in complexity over time with new additions, be simple enough to still make a mistake, or be misunderstood.

If your data testing implementation isn’t granular enough, root cause analysis becomes infinitely harder. A transformation with 50 different steps yielding incorrect results has many different potential failure points (far more than 50!) that all have to be looked into.

Even before a failure happens, as an analytics team grows, a growing number of people will be using each dataset. They should all be under the same understanding of what values are acceptable and which ones aren’t.

We haven’t even talked about warehouse-native applications: if using operational tools that allow operational teams to use warehouse data directly, any untested broken data will be exposed much faster and have that much more operational impact.

A thorough testing framework should have data quality tests on every data asset: specifically, on every table and view. To ensure data is tested before the asset is updated, store the data in a staging table, run tests, and only if tests pass move the data into the production table.

Test product application data before transforming data

In an ideal world, application systems would always produce the data that the analytics systems are expecting, because engineering, product, and analytics teams would discuss each change during the product requirements phase.

In reality, companies often move fast and ship without giving much time for process, and data teams may not own the application data—and worst case they’re not even notified, let alone consulted, about changes. But even in ideal circumstances, relying on process still leaves plenty of room for bad data through unintentional errors.

It’s best to test data quality in the originating application itself, but that isn’t a replacement for testing data in the warehouse before transformation.

Connect directly to a PostgreSQL, MySQL, or MSSQL database to run tests that include both volume and business logic checks. The goal here is to make sure data copied doesn’t break assumptions made by the analytics team using the data downstream.

GX for quality testing relational data

However your data pipeline is structured, if you want truly high-quality data you need full test coverage: tests that check your data every time it’s moved or transformed. Without complete test coverage, quality errors can manifest in between tests. The farther the issue is from the test, the more likely it is to go undetected, and the more difficult it is to troubleshoot if you do find it.

GX has a wide range of integrations and a pre-existing library of 300+ tests, called Expectations, setting you up for fast and effective data quality deployment.

Expectations can be as granular as you need, making GX suitable for virtually any use case. You can easily use GX to check for relatively simple issues like too many nulls and statistical values outside of the expected range, as well as for more complex tests that take advantage of your knowledge of your application and analytics environment. You can even create your own tests to look for unique-to-you quality issues.

Get started testing your relational data sources today: download Great Expectations and get started with our tutorial, or learn more about GX OSS.