backgroundImage

Four kinds of data quality issues to look out for

Successful data quality projects start here

Nick O'Brien
August 24, 2023
A photo of a magnifying glass being held up in front of yellow flowers, in which only the flowers seen through the magnifier are in focus
📸: Jess rodriguez via Adobe Stock

Datasets, to borrow a phrase from Walt Whitman, contain multitudes. When you’re taking on a large data quality project, looking for every potential issue at once is all but impossible. 

Instead, data practitioners should take a more targeted approach that prioritizes the biggest, most common categories of data quality issues. We’ve identified four of them below. These issues can be highly impactful, which is why Great Expectations is working on new Data Assistants that specifically target them. 

As you’ll see, the kinds of issues we’ll cover typically stem from one or more of the following causes: 

  • Human error, or decisions made by individuals without consulting your team

  • Dependence on external systems you can’t control

  • Unintended consequences of an intentional change in how data is exchanged or displayed

  • An exchange of data between two systems that present data according to different norms

Keep reading to dive into each type of issue and its nuances and complexities.

1. Missing data

We’ve all dealt with incomplete datasets—even if we didn’t realize it at first glance. Common issues include: 

  • Empty cells
    A cell might contain no data, or contain a value, such as a question mark, indicating a lack of important information.

  • Missing record
    An entire record might be absent from a time series or batch. But there’s a silver lining: This might at least be more noticeable than a single empty cell.

  • Incomplete information from originating sources
    For this, let’s look at an example that one of our clients experienced. The company pushed a change to its analytics system that caused its receiving end to reject messages from an older version of the software where the data originated. This didn’t stop the older version from sending data – instead, that data was simply left out of updates to the company’s datasets, rendering them incomplete. 

2. Data freshness

Another common type of issue is stale data—that is, datasets that don’t reflect recent changes. This is often the result of one or more originating sources becoming unavailable, perhaps due to a software update or system malfunction, which prevents updated data from reaching your receiving system. 

We’ve seen freshness issues take the following forms:

  • Data arrives as expected but isn’t fresh
    We once worked with a company that used an API to pull stock market data from an external system on a regular basis. One day, that external system experienced an outage. The company soon discovered that, unable to get updated data while the system was down, the API had instead been sending old data loads that predated the outage.

  • Data doesn’t arrive, rendering the existing data stale
    Sometimes an issue at the originating source will entirely halt your inflow of new data. Without new updates, it doesn’t take long for your existing data to go stale.

  • Data arrives, but not at the expected rate
    If you pull data from multiple sources, then the freshness of your data relies on all of those sources being up and running and providing you with timely updates. It only takes one outage or software update to significantly slow your inflow rate.

3. Volume issues

Lots of common data issues can cause your data volume to swell and fall unexpectedly. When looking for these cases, it’s key to keep an eye on both arriving and existing records. If records fail to arrive or arrive more than once, you’ll see unexpected volume changes. Ditto if you find that existing records have been deleted or duplicated.

Here’s what you might see:

  • Unexpected fluctuations in data volume
    If a report shows that a taxi company, for instance, had four rides one day and 731 the next, something is probably off. The opposite scenario—a report showing no fluctuation at all—may also indicate a problem, as taxi companies don’t typically have the exact same number of rides each day.

  • Unexpected volume distribution
    If a sales team, for example, runs a report showing that the majority of new deals in the previous month closed on the weekend, they know they have a distribution problem, as sales activity typically takes place between Monday and Friday.

4. Schema issues

Finally, be sure to pay attention to how the data itself is structured. Your data quality analysis might reveal:

  • An unexpected amount of columns
    You might find that an entire column has disappeared from your dataset or, conversely, discover additional columns you didn’t expect.

    This often stems from disagreements about the parameters of the schema in question. In one case we saw, a team had agreed on a set number of values a certain column could have, but another stakeholder needed to add a sixth value. Without aligning with the team on how to handle the dispute, that stakeholder decided to house the sixth value in an entirely new column, creating confusion among their colleagues.

  • Data arrangement issues
    Maybe you have all the columns you expect and no more, but they’re in an unexpected order. Maybe the column names are missing or don’t match the data that’s actually there. 

Tip: Make it iterative

When analyzing these types of data issues, some data practitioners try to catch every potential error in one analysis and then move on. This almost never works. Every time you make a change to your data, you’re looking at a new dataset—fixing one issue may lead you to another you wouldn’t have noticed otherwise—and thinking you can do everything at once is a fallacy.

That’s why it’s critical to take an iterative approach. Conducting analyses on a regular basis will continually improve your process—and yield better data quality. It will also create chances to bring in new stakeholders with different perspectives, which is critical for a well-rounded analysis.

Conclusion

We’re excited about the possibilities that Data Assistants offer for helping data practitioners zero in on these four types of quality issues. Check out the experimental Missingness Assistant and stay tuned, or get in touch to learn more.

Search our blog for the latest on data quality.


©2024 Great Expectations. All Rights Reserved.