Data contracts are pointless if they aren’t useful. And data people need them to be useful because we have data problems to solve.
Specifically, we need data contracts to accomplish one key task: orienting everyone in the same direction, so the real problem-solving work can be effective. But there’s a huge range of scenarios that fall under data problems in an enormous variety of contexts.
Here’s how you can define data contracts in a way that’s concrete enough to achieve their task and still flexible enough to work everywhere.
Define a data contract
At Great Expectations we’ve been wrestling with finding the right way to define data contracts for a long time. Here’s what we’re currently working with:
A data contract is an inspectable artifact that describes an alignment between data users on what makes the data fit for its intended purpose.
First—you’ll notice that our definition doesn’t include any particular technology, not even our own. We did this on purpose because data contracts have to be durable long-term. They need to be able to incorporate any tool a data team might be using, even tools that haven’t been invented yet.
Sure, we don’t expect, for example, API functionality to go away anytime soon, but you never know. And ‘having data’ is a concept that’s independent of any particular software or hardware, so we believe that ‘data contract’ should be too.
We also made a deliberate decision not to include ‘enforceable’ or ‘that enforces’ or anything along those lines for two main reasons:
To maintain the analogy with actual legal contracts. A legal contract doesn’t enforce itself, it’s simply a piece of paper (or more likely a collection of electrons); some kind of state power enforces a legal contract. If a contract is broken, but all parties involved ignore it… nothing happens.
It didn’t make sense to us to hold data contracts to a different standard than that.
We also wanted to maintain the idea of the verbal data contract. Like its legal counterpart, a verbal data contract is materially less enforceable and, therefore, less desirable than an inspectable data contract. But verbal contracts are absolutely everywhere and serve a purpose, even if in a less-than-ideal way.
Another part of the definition we put a lot of thought into is our choice to describe the parties as data users. Users easily encompasses people who help originate the data or facilitate its entry into the organization’s control (the effective data producers), active data consumers like data scientists and business analysts, and more passive data consumers like dashboard users.
Finally, to describe what the data contract was actually aligning people on, we chose ‘fit for intended purpose.’
This is one of a few popular phrases that often pop up in definitions of data quality. Another is ‘fit for use,’ which we decided against because purpose captures a level of meaning and intent that use doesn’t.
We added the modifier intended deliberately because one of the ways that data non-alignment can occur is when existing data is co-opted for some purpose, with the new data users not fully understanding it.
Without a data contract, it’s extremely difficult for new data users to reliably know whether or not they understand what the data is meant to be: even seemingly-straightforward data can have hidden nuances. With a data contract, new data users can simply refer to the contract to learn what the data is supposed to be like and determine whether it’s suitable for their purposes as-is.
Another way of describing data alignment is knowing ‘what the data is’ or ‘what the data should be.’ We found these phrases to be too vague: they can encompass everything from semantic meaning to file formats.
By using ‘fit for intended purpose,’ we describe the understanding of the data in a concrete, functional way: the data users are going to do X, and in order to accomplish X, the data has to Y.
Figuring out the Y is one of the complex but essential parts of negotiating a data contract.
Data contracts are defined: now what?
At this point, you might be saying: ok, how is that definition useful without specifics?
That's the beauty (and the challenge) of data contracts. You take this definition and figure out what its implementation needs to be to work in the specific context of your team, your organization, and your pipelines.
Maybe Chad Sanderson's proposed solution fits you perfectly. Maybe you need a feedback loop between your nontechnical experts and your data team to define requirements, then custom checks that are shareable with the nontechnical team members. Maybe you need a bunch of color-coded post-its on the wall. We don't know your life.
There's not a one-size-fits-all, or even one-size-fits-most, data contract. But if you use our definition, you can figure out the one-size-fits-you data contract, which is the only one you need.
What do you think of our definition? Let us know @expectgreatdata!