Continuous Integration for your data with GitHub Actions and Great Expectations
One step closer to CI/CD for your data pipelines
October 01, 2020
If you are reading this before Oct 8th, you can join our Community Show and Tell, where we will demo this outstanding integration for the first time. Sign up here
You might have noticed that we’ve been busy in the past few weeks working on some really amazing collaborations with fellow tech and data folks in the Great Expectations community (like the Dagster integrations and our Komodo Health and Calm case studies). This project has been brewing for a while, and we’re absolutely over the moon (yes!) to announce that we’ve just published a GitHub Action for Great Expectations aka “CI/CD for data”, live in GitHub. This means that you can now have data validation as part of your continuous integration (CI) workflows to secure your data pipelines and prevent data pipeline bugs from getting into production. Read this post to learn more about what we worked on and how you can make use of the integration, or just go straight to the repo and check out all the info in the README!
What are GitHub Actions?
GitHub Actions are a feature in GitHub that helps you automate your software development workflows in the same place you store code and collaborate on pull requests and issues. You can write individual tasks, called actions, and combine them to create a custom workflow. Workflows are custom automated processes that you can set up in your repository to build, test, package, release, or deploy any code project on GitHub. With GitHub Actions you can build end-to-end continuous integration (CI) and continuous deployment (CD) capabilities directly in your repository.
How do GitHub Actions integrate with Great Expectations?
Over the past couple of months, our team (in particular GE engineer Taylor Miller) has been working closely with Hamel Husain from the GitHub team to create an action that allows you to run data validation with Great Expectations from your GitHub repository when you create or update a PR (or based on other GitHub events). You can find detailed step-by-step instructions in the documentation for this action, but here’s a quick peek at what your workflow will look like:
- Make sure your data pipelines or model code is in a GitHub repo.
- Set up a deployment of Great Expectations, connect to your data (files, SQLAlchemy sources, Spark dataframes…), and create Expectations to assert what you expect your data to look like. The data could either be real data in a dev/testing environment, or static data fixtures. Configure your GitHub repository to use the GE action, and connect it to your datasource by adding credentials to GitHub Secrets, if needed.
- Modify your data pipelines, re-run them in a dev or test environment.
- Push the modified code and create a PR.
- This will then trigger the GitHub action to run data validation with Great Expectations on the dev/test data environment and publish the validation result to your PR as a comment. You can also configure Data Docs to be served on a platform such as Netlify.
We can think of several different applications for this action. For example, in an ETL pipeline, this could be as simple as making sure that changes to the pipeline don’t introduce any data quality issues into the downstream data. In order to isolate issues caused by pipeline changes vs those caused by data changes, we recommend running these tests on static test data. In an ML context, you can also test that the output of your model meets certain expectations after making modifications to the model.
Why are we so excited about this collaboration?
To the best of our knowledge, this is one of the first integrations of data testing and documentation in a CI/CD workflow that’s supported by a platform as big as GitHub. We all know we should test our data pipelines, but it’s often done either manually by the data engineer during the development process, or depends on a home-grown data validation system. Neither of these solutions is particularly reliable, scalable, or sustainable in the long term. Just as you would run integration tests on a PR for code, the GE GitHub action runs data tests on your updated data and catches any potential issues in the code changes before they get into production. Any engineer or data scientist making changes to the pipeline can run the regular GE tests locally, but the CI tests will provide an additional safety net, plus you could even be running more extensive tests on remote infrastructure.
You’ll find detailed information and instructions about the action in the Great Expectations Action repo, hop over to check it out and get started. And as always, feel free to join the GE Slack channel if you have any questions or want to contribute to the open source project!
And finally, once more a big thanks to Hamel from the GitHub team for this amazing collaboration, it’s been an absolute pleasure working with you!
You should star us on Github