A team of security analysts once published a report naming New Zealand as the world’s second-largest source of cyberattacks, a Reddit user recently recounted. Eventually, they discovered that this report was based on bad data from faulty security sensors in the country – and thus completely wrong. The ambassador from New Zealand was ready to cry foul about the mishap at the UN but backed off when the analysts retracted the report.
That’s the thing about data quality: One simple issue could be all it takes to cause an international incident.
Few organizations would tell you data quality doesn’t matter. Externally, clumsy data errors can make you look unprofessional, causing reputational and financial damage. Internally, they can make stakeholders give up on official data sources and start going with their gut instead, leading them to uninformed decisions and wasted time. Before you know it, you could be dealing with:
Lower productivity
Every minute spent fixing reports, pipelines, and dashboards is a minute not spent advancing your company’s business objectives.
Lost revenue
Poor data quality can lead to poor communication between systems. If, for example, a company’s systems aren’t communicating correctly about which service a customer uses, the company could end up underbilling and losing money.
Compliance issues
If stakeholders who have lost trust in official data start creating their own data stores with no oversight, they may run afoul of regulatory requirements in areas such as handling PII.
If this all seems overblown, it’s not: Gartner reports that organizations lose an average of $12.9 million annually due to poor data quality. But it’s one thing to think of this issue in terms of reports, statistics, and marketing language – it’s another to see what it actually looks like on the backend for the people involved. On that note, let’s venture inside the trenches and see the kind of fallout real engineers deal with when data quality goes awry.
Crowd-sourced data quality horror stories
The New Zealand anecdote comes from a response to a question a Great Expectations team member posed on Reddit: “What’s your favorite data quality horror story?” Here are a few others:
A sports brand noticed a precipitous drop in water bottle sales. It came to light that, twelve months earlier, a single bottle had been erroneously sold for $1 million. It was corrected in-store, but the correction never made it to the corporate databases. A full year isn’t so long for that to go unnoticed...right?
A logistics company that charges a 1% fee on COD amounts accidentally billed a client hundreds of millions on a single transaction. The client’s customer had inadvertently entered their 11-digit account number instead of the price of the product.
A company relied on an algorithm to update their prices in near-real time. What they didn’t often update was their technology. One day, the algorithm misfired so badly that each of the more than 100,000 products the company offered was priced – and listed publicly – at $0.01.
While doing some troubleshooting, an engineer hit a wall when trying to search for a number in a column that should have been numbers, ultimately discovering that someone had converted that column to ‘varchar.’ The engineer soon learned that absent any proper process, their colleagues had been using ‘convert to varchar’ as a cure-all for any and all data problems.
How to avoid telling your own horror stories
That last story points to an issue at the root of innumerable data quality problems: A lack of process. SOPs and best practices are critical for minimizing data quality issues, and solving the issues that do occur means properly investigating root causes. All of this starts with involving the right stakeholders across your organization. Talk to domain experts and those who work hands-on with the data – that includes both those who create data and those who analyze it.
Along the way, keep things fully transparent. If the data engineer in the final horror story above had colleagues who were communicating broadly and consistently, they may well have uncovered and solved the issues they faced much faster. Make sure everyone is on the same page – not just about what’s wrong with the data, but about how ‘right’ is defined.
For example, an analytics campaign about advertising spend may only require regional data, while data pertaining to a physical mail campaign would need to include specific addresses. If there’s not enough alignment on the level of specificity required in a case like this, engineers could end up getting conflicting information from various stakeholders about what and how much work they need to do to address an issue.
Conclusion
It’s much more fun to read data quality horror stories than to experience one yourself. By keeping your processes transparent and centralized and your lines of communication open, you can enjoy tales from the trenches without worrying about what horror might be brewing in your own data.