At Great Expectations, we’ve been watching the evolving public conversation on data contracts with great interest. We’re convinced that something noteworthy is happening, and that the notion of data contracts is going to be an important part of future data stacks.
Naturally, we’re also curious about how data contracts will interact with testing, data quality, and Great Expectations specifically. It’s been the subject of a lot of discussion, both within the GX team internally and with others in the data ecosystem.
In this post I'm taking the long view, starting from the question of whether these newfangled data contracts are really different from what’s already happening on data teams.
Data contracts are about alignment
There are already a bunch of different definitions for data contracts floating around. Of those, Chad Sanderson’s definition is arguably the most prominent:
Data Contracts are API-like agreements between Software Engineers who own services and Data Consumers that understand how the business works in order to generate well-modeled, high-quality, trusted, real-time data.
This is specific and provocative.[1]
If you take Chad’s definition and mash it up with others, you’ll wind up with something like this working definition:
A data contract is a means of achieving alignment between data producers and data users on what makes the data fit for its intended purpose.
In other words, many parts of Chad’s definition (“API-like,” ”between Software engineers… and Data Consumers,” “understand how the business works,” “well-modeled,” “real-time”) are controversial.
However, the need for cross-team alignment about data does not seem to be controversial at all.
3 phases of alignment
How do data teams get aligned about data today? One snarky answer is that they don’t, which causes all manner of confusion and frustration.
Still, for all the leaky pipes and breakages, I think it’s fair to say that most data teams manage to accomplish stuff today, so they must have some means for achieving some level of alignment.
After a lot of thinking and a lot of talking, I submit that you can bucket those means into three phases of data contracts: verbal, written, and automated.
I think of these as similar to the three everyday phases of matter.[2]
Verbal contracts are like gases: with no fixed shape or volume; they naturally expand to fill empty space (the need for alignment in an organization).
Written contracts are like solids: they're dense and inflexible, but probably better than a gas if you want a foundation that you can build on.
Automated contracts are a bit of an unknown—we’re just getting started with them–but I hope that they can function like liquids in a hydraulic system: fluid, but also load-bearing, able to do work that could never be done by gasses or solids.
Of course, there’s some overlap across phases and plenty of variation within phases, depending on how a given organization implements data contracts. But overall, certain inflection points mark phase transitions, where a means of alignment changes and becomes useful in materially different ways.
Let’s look at each of the three phases.
Verbal phase
The first phase of data contracts is what I’m going to call verbal data contracts. They’re extremely informal, but—lacking better alternatives—what many companies use anyway.
Just to say it: data contracts don’t have to be intentional. Plenty of companies have implemented data contracts without ever using that phrase. But to talk about alignment, everyone needs to agree that alignment is even happening… which is where having a term like ‘data contracts’ is helpful.
Verbal data contracts get formed all the time in casual conversation:
Data consumer: “X happened.”
Data engineer: “Oh, that’s because of Y.”
Data consumer: “Can you fix Y so X doesn’t happen?”
Data engineer: “Sure.”
Boom. You’ve achieved alignment, moved things forward, and agreed upon an action. You have a data contract.
Of course, this contract is completely ephemeral. Neither of you can go back and verify any part of the contract, and new people can’t be brought in without additional discussion. And if there’s any dispute about what the data contract contained, you’re back to square one.
It’s also easy for verbal contracts to be imprecise. When a business analyst says to “take the average,” is that the mean, median, or mode? Over what time frame? Including what cases? It’s rarely worth the effort to spell these things out in so much detail when you’re just talking, and you’re liable to lose track of those details later on, anyway.
Verbal contracts are the fast, lightweight way that almost every data team starts out. But as the cumulative weight of tacit knowledge builds up over time, verbal data contracts can degrade into flailing around in the dark: no one has enough context to take any purposeful actions. If the problem gets resolved, hopefully someone knows what did it—and if no one does, everyone crosses their fingers and hopes it doesn’t happen again.
So the first big inflection point comes when somebody decides to write the contract down.
Written phase
Data contracts in the written phase have artifacts: data dictionaries, documentation, wikis, descriptions in data catalogs, comments in SQL files—you can think of all of these as written data contracts.
In large organizations, you often see these contracts being closely reviewed and enforced by governance/compliance teams. In some cases, they’re even backed by legal contracts (e.g. confidentiality guarantees under data sharing agreements) or regulation (e.g. data lock for pharmaceutical studies).
Written data contracts are a huge jump up from verbal on many fronts. They’re more inspectable and allow greater precision. With greater precision comes a higher degree of objective verifiability: when a data dictionary says that columns X, Y, and Z are all strings with no more than 120 characters, you can run a SQL query to determine whether or not that’s true.
This combination of properties makes written data contracts much more powerful tools for bringing in everyone who uses the data, which is good because widespread buy-in is essential to success. Written data contracts are also a much better way to ensure continuity across time.
Note that there’s no guarantee that an organization will choose to invest the effort to make sure that all of its written data contracts are enforced—the outdated data wiki is a well-known trope among data teams everywhere. But by writing it down, at least you have the comfort of knowing that it could be backed up by actual validation.
But it’s pretty easy to get mired down at this point.
Written contracts create a tension between flexibility and compliance. Broadly speaking, most organizations end up with either a “data wiki that’s chronically out of date” or “data pipelines that can’t be changed without lengthy review involving the compliance/privacy/governance/legal team.”
Neither is ideal; both create their own friction points, which is why some teams—especially smaller teams—opt to avoid written contracts entirely, and just stick with verbal ones.
The second big inflection point is when a written contract becomes automated.
Automated phase
In the automated phase, a data contract integrates not just into an organization’s teams and workflows, but directly into the company’s data stack. This creates new opportunities to play around with workflows.
For example: automated data contracts can be self-enforcing. Both verbal and written data contracts are manually enforced: a person must notice an issue in the data and realize that it’s in conflict with a data contract.
Integrating automated contracts directly into CI/CD and data monitoring workflows opens up new possibilities: rather than requiring manual review, they can be tested directly against data, both at compile time and at batch time.
Another example: automated data contracts can translate their own contents for data consumers who speak different technical languages. For example, Great Expectations allows you to compile test suites directly into human-readable documentation, to make sure that tests and docs always stay in sync. Everyone who uses the data—technical and nontechnical users alike—has a version of the data contract that’s in a language they can read.
Note all the ‘can’ language in the previous paragraphs. We’re still in the early days of defining what data contracts can and should do.
This is why the conversation about data contracts is so important. In the GX community, we’ve observed that a large fraction of use cases are focused on the deployment of Expectations at the borders of team boundaries, helping teams stay aligned on what makes data fit for its intended purpose.
Figuring out what key jobs automated data contracts need to do—and what standard technologies can make that happen—will be invaluable to their widespread adoption. I see some version of automated data contracts as the best path forward for making data systems more engineer-able, which needs to happen in order to revolutionize the role that data plays in our society.
Comparing and contrasting the 3 phases
Let’s recap. Here’s a compare-and-contrast table for different phases of data contracts. I’ve seen all of these implemented in real organizations, with varying levels of success.
A few key points:
Each means of alignment has its own strengths and weaknesses.
The pros and cons depend strongly on organizational context: who is involved with defining contracts, with what requirements, how a given organization chooses to enforce them, etc.
Until recently, most teams had to choose between verbal and written contracts. The ability to partially automate many of the means of alignment opens up lots of interesting new possibilities and questions.
Conclusion
I love the early direction that the public conversation around data contracts is taking. Lack of alignment around the intended format, contents, and uses cases for data is a huge source of confusion and frustration for data teams everywhere—absolutely a problem worth solving.
By taking the long view in this post, I hope to have helped teams identify where they’re coming from, and surface some of the possibilities for where they could go in the future.
If you’d like to be part of that conversation, I’d love to talk! I’m especially interested in talking with day-to-day data practitioners, to understand what’s most important to you in a data contract, or your experiences in implementing them. You can reach me on Twitter or LinkedIn, or share your thoughts in the GX public Slack channel.
[1] I strongly approve. Chad’s done a great job of putting a specific thesis on the table, and provoking a meaningful public conversation.
[2] The whole gas-solid-liquid thing is just a suggestive metaphor that I wouldn’t interrogate too strongly.