Exploring data quality: schema validation

The unsung hero of regulatory reporting, data quality has many dimensions to consider

GX team

September 13, 2024

Never miss a blog

a stylized data table with some columns marked with a green checkmark and others with a red x, with an orange magnifying glass superimposed on it

This post is the first in a seven-part series that uses a regulatory lens to explore how different dimensions of data quality can impact organizations—and how GX Cloud can help mitigate them.

In this post, we’ll begin to explore how FinTrust Bank, a fictional financial institution, tackles its data quality challenges. And we’ll see how Great Expectations, via its GX data quality platform, mitigates those challenges and ensures the integrity of regulatory reporting.

Like most modern organizations, FinTrust Bank is inundated with huge amounts of data that are critical for their decision-making. And as a financial institution, they’re in one of the industries that needs to be particularly concerned with regulatory compliance.

Inaccurate or incomplete data can lead to flawed financial statements, misguided business decisions, and severe regulatory penalties. It erodes trust with stakeholders and damages the institution’s reputation.

Data teams at financial institutions don’t start out in an ideal place. Vital data often arrives in a less-than-refined state, riddled with inconsistencies and errors. Data teams have the challenge of building a data quality process that can transform this chaos into accurate and reliable data—with extra-high stakes from the regulatory implications.

So at organizations like FinTrust bank and its real-life counterparts, ensuring data integrity isn’t just a task: it’s a strategic imperative.

Leading the charge at Fintrust Bank is Samantha, a data engineer with over a decade of experience in handling financial data. Samantha stands in for the data teams at financial institutions who are committed to upholding high standards of data quality, despite the obstacles they face.

A data audit revelation

Samantha and her team embarked on a comprehensive data audit to assess the current state of their data assets. And their findings were alarming:

1. Missing fields: Key financial data—such as transaction amounts, account numbers, and timestamps—were often missing from the datasets. This omission could lead to incomplete regulatory reports, which might result in substantial fines.

2. Incorrect column types: Several columns contained data of unexpected types. In one case, a 'transaction_amount' column, which should contain numerical values, included string entries. This mismatch could lead to calculation errors and inaccurate financial reports.

3. Logical inconsistencies: Certain numerical fields, such as transaction amounts, contained values that defied logic (negative deposits and positive withdrawals are not real things). These anomalies could severely distort financial metrics and forecasts.

FinTrust’s quarterly compliance report is a cornerstone of its regulatory submissions. After auditing it, Samantha’s team found that 15% of the required fields were either missing or contained erroneous data. These discrepancies put FinTrust Bank in direct danger of substantial penalties from regulatory bodies, not to mention reputational damage.

Given the audit findings, Samantha knew the team had to act fast. FinTrust needed a robust and systematic way to improve the integrity of its data and ensure that every piece of information was accurate, complete, and compliant.

Schema validation to the rescue

Following the alarming results of their audit, Samantha and her team decided to tackle schema inconsistencies first. Their reasoning: without a solid foundation for the data structure, solutions for more complex data quality issues wouldn’t hold up. A robust schema validation process was now essential for FinTrust.

Schema validation defines a clear blueprint for data structure and ensures that the data conforms to predefined structures. When working in the financial industry with its sensitive data and stringent regulatory requirements, schema validation provides crucial risk mitigation, support for compliance, and streamlined operations.

What is a data schema?

The schema is the structural blueprint of a dataset. It defines the overall organization of information, including elements like column names and data types. Ensuring that data adheres to its expected schema is fundamental to maintaining its quality and reliability.

The GX platform equipped Samantha’s team with the schema validation tools they needed. With GX, the data quality team could easily define their expected data schema and automatically validate whether their data met these standards.

Key schema Expectations in GX

Let's look at the key schema Expectations that Samantha's team leveraged in GX:

Expect column to exist: Ensures that all essential columns are present and accounted for with the expected name, detecting missing data fields or renamed columns.
Expect table column count to equal: Verifies that only the expected number of columns are present, a safeguard against unexpected data piggybacking on legitimate data.
Expect column values to be of type: Confirms that each column contains the correct type of data—dates, numbers, text, etc. This drastically reduces the need for manual oversight of the data.
Expect column values to be in type list: Ensures that the values in a specified column are within a specified type list. This is especially useful for columns with varied permissible types, such as mixed-type fields often found in legacy databases.

These Expectations can be applied directly from the GX Cloud UI or using the GX Core Python framework, streamlining the validation process.

Expect the unexpected (and prevent it)

Using GX for schema validation, Samantha and her team were able to transition from firefighting schema-based data quality issues to proactive assurance work.

Implementing these Expectations created an immediate impact for FinTrust bank. FinTrust Bank’s error rates in its reporting dropped from 5% to 0.25%, significantly reducing their risk of regulatory penalties.

Beyond schema validation

While establishing your initial schema validation is crucial, it’s not the end of your schema validation work. Your schema will continue to evolve, so you need to have a way to manage the corresponding changes and keep your validation up to date. By implementing structured processes for schema versioning and maintaining comprehensive validation practices, you can anticipate and mitigate potential disruptions.

Join the discussion

How has schema validation transformed your data processes? Join the discussion in our community forum to exchange ideas and experiences with other data practitioners. Together, we can improve best practices for data quality in every industry.

Stay tuned for Part 2 of our series, where we explore data missingness and related strategies for improving your data quality.