Sitting back and reacting to complexity isn’t a solution. Instead, pipeline tests can improve an analyst’s ability to untangle sophisticated data logic when maintaining analytics code and dashboards built on top of it.
Defining pipeline tests and their impact
Pipeline tests are unit tests on data applied at batch runtime, not load or deploy time. When there are problems, action can be taken immediately to remediate them.
For software engineers, automated testing is just what you do to manage a rapidly growing complexity emerging in any code base. In data pipelines, the concept of testing data is much newer and functions fundamentally differently (I’ll get to those differences later).
Consider a world in which the analyst is the one notifying the VP of Marketing there may be an issue with a dashboard. If each upstream dataset had tests that prevented any downstream dependencies from running unless the tests passed, identifying the source of failure would be immediate.
In the case of a broken dashboard we outlined, already knowing the source of failure would save the analyst days of tracing dependencies in the pipeline. Expanding this time savings to all pipelines, the amount of effort needed to maintain existing code is greatly reduced.
Let’s revisit that lifecycle trajectory of a data pipeline.
Lifecycle of a Data Pipeline with Tests
The blue line illustrates the time it takes to triage an issue without tests, while the yellow line shows effort over time with the presence of pipeline tests.
In the yellow line scenario, the up front development time is slightly greater. However, this pays off in the long run to reduce the technical debt that will accumulate. Spending more time earlier on building a more robust architecture increases the complexity carrying capacity of the team.
Simply put, pipeline tests reduce the amount of time needed for maintaining existing code. This allows the team to develop more new features to support the growth of the business.
Who wouldn’t want that?
Great Expectations: key concept
Hopefully we’ve convinced you it’s worth spending extra time writing tests. You’re probably wondering: where does Great Expectations come in?
First let’s introduce a few key concepts:
- Expectations & Expectation Suites: A single Expectation is the way certain data should appear. It’s truly what you expect of your data. An Expectation Suite is a set of Expectations, a grouping which defines characteristics of an entire data asset.
- Validations: A Validation is a process in which a set of data is compared to its associated Expectation Suite to determine if the batch of data meets expectations.
- Checkpoints: A Checkpoint defines a batch dataset to fetch, a set of validations to run on the data, and any associated actions such as updating documentation. This is where it all comes together.
Expectations are defined when building a new pipeline or data model, as well as Checkpoints to define what Expectation Suites need to run on which data assets. After the code has been deployed, a Validation should be run with every pipeline execution. This flow ensures that every time data is updated, it is also validated.
The key difference between software unit tests and pipeline tests is that unit tests typically run within development or at deploy time, while pipeline tests run on every execution within the data pipeline itself.
However, if the organization’s expectation (no pun intended) isn’t already set that the time to build a pipeline must include time to build out a testing framework, this step can easily be put in the backlog.
When not top of mind, coming back to build tests after something is already working is hard to prioritize. Once it breaks, pipeline debt has already accumulated, which is exactly what tests help avoid.
Concepts are great, but what does the actual architecture look like? Let’s get to it.
Implementation: a guide
Great Expectations is most commonly built using Python, by installing the pip package and using Jupyter Notebooks or Python scripts to build Expectations as well as run Validations.
Consider a table in the data warehouse
. It might make sense for this table to be unique on
, where each row represents one user of a particular product in a company. There could also be columns with user properties such as email and other information about how they behave with the particular product.
When building an Expectation Suite or running Validations, the data must first be fetched in a
from a particular
before adding the Expectations themselves (more documentation on batches can be found here).
With the context configured and the batch of data instantiated, the actual Expectations can be defined based on the business context surrounding the dataset. For instance, below is an example of an expectation ensuring the table is unique on
1batch.expect_column_proportion_of_unique_values_to_be_between(2 column="user_id",3 min_value=1.04)5
Expectations can fall into two categories: architectural integrity of the data, and business context; both are equally important. It can be hard, with each Expectation Suite, to think about what to test from scratch each time. Instead, a framework helps to make sure test suites are consistent and thorough.
I suggest to always start with the following patterns:
- Expect uniqueness on a particular column, or set of columns.
- For numeric columns, expect the minimum, maximum, or mean values to fall within a range. The range itself is often deduced from business context and may vary column to column.
- For character columns with a finite and somewhat short list of possible values, expect the column to only contain those values.
- For all columns, expect a certain percent of rows without null values. If a column is always null, it is likely wrong or not needed.
Of course, this may not be an exhaustive list of Expectations. For instance, comparisons across columns could ensure certain conditions expected across the business. A full glossary of the types of Expectations already implemented within the package can be found here.
Tying it back to our initial analyst working on marketing dashboards, an Expectation on
might be particularly useful for the marketing team. Consider the following on
, making sure the signup source is only from Paid Ads, Referral, or Organic sources:
1batch.expect_column_values_to_be_in_set(2 column="signup_source",3 value_set=["Paid Ads", "Referral", "Organic"]4)5
If users stopped coming from paid ads all of a sudden and transitioned to a completely different source not in the list, this might cause unexpected behavior in a dashboard that analyses all signups from paid ads. If the test on
were to fail, the analyst would have an initial point of failure to start investigating before the data gets to the dashboard.
Once an Expectation Suite has been defined, how is it actually used?
The answer: Validations. Validations are run as part of a Checkpoint, defining Expectation Suites to validate. A Checkpoint can be created as documented here.
At pipeline run time, a Checkpoint is triggered after the new data is generated but before it is saved to production data assets and pushed into BI. A Checkpoint can be triggered via a Python function, or integrated directly in your workflow orchestrator of choice via the new Great Expectations Airflow Operator or with a direct Dagster integration.
As an example, say there is an Airflow DAG that updates several tables, including the
table. Each table would have a query generating the data, running tests, and then saving the data from the temporary table only upon passing tests.
With this type of task dependency, the
table is ensured to meet the business Expectations identified when implementing the table. If they aren’t met, the pipeline fails and notification can be sent to the team for further investigation.
How it all ties together
Any project, data or software engineering, gains complexity over time. Data changes, bugs come up, edge cases occur. Over time the projects can become hard to manage, as finding the root cause of an issue becomes complex with the rising complexity of the pipeline.
However, pipeline maintenance required can be mitigated with unit tests. Putting the guard rails in place helps increase the complexity carrying capacity of a team, resulting in higher efficiency.
Great Expectations provides a framework for testing data whether it be within a standard data pipeline or a complicated machine learning workflow. Unit tests solve the issue many data teams face: pipeline debt quickly outgrowing the team’s capacity.
this blog is inspired by Strata 2018: Pipeline Testing with Great Expectations and Testing and Documenting Your Data Doesn't Have to Suck
By Sarah Krasnik
Thanks for reading! I love talking data stacks. Shoot me a message.