Datasets, to borrow a phrase from Walt Whitman, contain multitudes. When you’re taking on a large data quality project, looking for every potential issue at once is all but impossible.
Instead, data practitioners should take a more targeted approach that prioritizes the biggest, most common categories of data quality issues. We’ve identified four of them below. These issues can be highly impactful, which is why Great Expectations is working on new Data Assistants that specifically target them.
As you’ll see, the kinds of issues we’ll cover typically stem from one or more of the following causes:
Human error, or decisions made by individuals without consulting your team
Dependence on external systems you can’t control
Unintended consequences of an intentional change in how data is exchanged or displayed
An exchange of data between two systems that present data according to different norms
Keep reading to dive into each type of issue and its nuances and complexities.
1. Missing data
We’ve all dealt with incomplete datasets—even if we didn’t realize it at first glance. Common issues include:
A cell might contain no data, or contain a value, such as a question mark, indicating a lack of important information.
An entire record might be absent from a time series or batch. But there’s a silver lining: This might at least be more noticeable than a single empty cell.
Incomplete information from originating sources
For this, let’s look at an example that one of our clients experienced. The company pushed a change to its analytics system that caused its receiving end to reject messages from an older version of the software where the data originated. This didn’t stop the older version from sending data – instead, that data was simply left out of updates to the company’s datasets, rendering them incomplete.
2. Data freshness
Another common type of issue is stale data—that is, datasets that don’t reflect recent changes. This is often the result of one or more originating sources becoming unavailable, perhaps due to a software update or system malfunction, which prevents updated data from reaching your receiving system.
We’ve seen freshness issues take the following forms:
Data arrives as expected but isn’t fresh
We once worked with a company that used an API to pull stock market data from an external system on a regular basis. One day, that external system experienced an outage. The company soon discovered that, unable to get updated data while the system was down, the API had instead been sending old data loads that predated the outage.
Data doesn’t arrive, rendering the existing data stale
Sometimes an issue at the originating source will entirely halt your inflow of new data. Without new updates, it doesn’t take long for your existing data to go stale.
Data arrives, but not at the expected rate
If you pull data from multiple sources, then the freshness of your data relies on all of those sources being up and running and providing you with timely updates. It only takes one outage or software update to significantly slow your inflow rate.
3. Volume issues
Lots of common data issues can cause your data volume to swell and fall unexpectedly. When looking for these cases, it’s key to keep an eye on both arriving and existing records. If records fail to arrive or arrive more than once, you’ll see unexpected volume changes. Ditto if you find that existing records have been deleted or duplicated.
Here’s what you might see:
Unexpected fluctuations in data volume
If a report shows that a taxi company, for instance, had four rides one day and 731 the next, something is probably off. The opposite scenario—a report showing no fluctuation at all—may also indicate a problem, as taxi companies don’t typically have the exact same number of rides each day.
Unexpected volume distribution
If a sales team, for example, runs a report showing that the majority of new deals in the previous month closed on the weekend, they know they have a distribution problem, as sales activity typically takes place between Monday and Friday.
4. Schema issues
Finally, be sure to pay attention to how the data itself is structured. Your data quality analysis might reveal:
An unexpected amount of columns
You might find that an entire column has disappeared from your dataset or, conversely, discover additional columns you didn’t expect.
This often stems from disagreements about the parameters of the schema in question. In one case we saw, a team had agreed on a set number of values a certain column could have, but another stakeholder needed to add a sixth value. Without aligning with the team on how to handle the dispute, that stakeholder decided to house the sixth value in an entirely new column, creating confusion among their colleagues.
Data arrangement issues
Maybe you have all the columns you expect and no more, but they’re in an unexpected order. Maybe the column names are missing or don’t match the data that’s actually there.
Tip: Make it iterative
When analyzing these types of data issues, some data practitioners try to catch every potential error in one analysis and then move on. This almost never works. Every time you make a change to your data, you’re looking at a new dataset—fixing one issue may lead you to another you wouldn’t have noticed otherwise—and thinking you can do everything at once is a fallacy.
That’s why it’s critical to take an iterative approach. Conducting analyses on a regular basis will continually improve your process—and yield better data quality. It will also create chances to bring in new stakeholders with different perspectives, which is critical for a well-rounded analysis.
We’re excited about the possibilities that Data Assistants offer for helping data practitioners zero in on these four types of quality issues. Check out the experimental Missingness Assistant and stay tuned, or get in touch to learn more.