How to streamline your data profiling through automation

Even if you have a busy pipeline, data profiling doesn't have to be a chore

Nick O'Brien

September 20, 2023

Never miss a blog

Robot arms work on an assembly line for solar panels

📸: IM Imagery via Adobe Stock

If you work in any data-related field, data profiling is likely a central component of your professional life. As we establish in our full exploration of data profiling, it’s almost impossible to complete a successful data project without the information you gather during this essential part of data quality.

But if you’re like most busy companies, your data is always changing. And as it moves around, increases in volume, and evolves, keeping an up-to-date profile can be a lot of work.

To ease this burden, data teams today are increasingly automating their data profiling. Here’s why automated data profiling matters, and how to make sure you take a smart approach when and if you pursue automation.

Automated data profiling: What’s at stake?

Data can change quickly. The importance of automated data profiling comes down to the question of how much time elapses between an update to your data and the creation of a data profile to reflect that update.

If there’s too long a lag, and your data is changing rapidly, then by the time you create your profile, it’s already outdated—your data has changed again. And if your data pipeline itself is automated, as it is for many data-intensive companies, your data will change even faster, making the problem even more acute.

The more your profiling is reliant on manual processes, the more likely you are to confront this issue. Automating your data profiling eliminates the lag, so you can rest assured that your profile is up-to-date and trustworthy, no matter how rapidly your data moves.

Best practices for automated data profiling

Here are some things to consider when your team starts looking to automate.

Decide whether to outsource or build in-house

There are plenty of reasons an organization may want to build their own tool to automate their data profiling—or not.

Handling it in-house can provide a level of control and peace of mind that can make a big difference. Maybe your data contains highly sensitive, proprietary, or confidential information, and you’re not quite comfortable giving a third party access to it. Maybe your operations depend heavily on your profile being sound and up-to-date, and even a slight error could have a big enough impact that you’d rather keep the process in your own hands.

On the other hand, there are also plenty of reasons outsourcing your automation could make more sense. Building your own tool, after all, can be expensive and time-consuming. Maybe you don’t have the resources.

Or maybe you only create data profiles on occasion—this may be the case if your pipeline isn’t particularly fast or your data doesn’t change too often—and so it doesn’t seem worthwhile to invest in building your own automation tool even if you do have the means.

It’s up to each organization to figure out, based on the content and velocity of their data, their available resources, and their priorities, whether to use a third-party tool to automate their data profiling or to build their own.

Practice smart vendor selection

There are a lot of data profiling automation vendors out there. There are also a lot of ways that data profiling can go wrong. Smart vendor selection means assessing how well a given vendor protects against any and all potential points of failure and facilitates a reliable, timely profile.

When testing a solution in a free trial or sandbox environment, apply it to all three main categories of data profiling—structural, content, and relationship—so you know it can serve your needs in any use case.

Also assess whether the profiles the solution creates are sufficiently informative. Any effective solution will help you answer critical questions about your data, including, but not limited to:

Is it accurate?
Is it complete?
Does the content match my expectations?
How is it formatted?

Don’t forget to evaluate for speed, as well—especially if you've got a highly volatile pipeline. The faster or more frequently your data changes, the faster you’ll need to create your profiles. Otherwise, as we discussed earlier, your profile will be obsolete as soon as it’s created.

Choose between developer and UI-based tools

Even in today’s era of cloud computing, some data quality teams may prefer an open source, developer-oriented solution to a cloud-hosted one. Maybe your infrastructure isn’t conducive to cloud-hosted SaaS or has esoteric requirements. Maybe you’re set up for bring-your-own-database workflows. Maybe it’s a simple matter of preference – your data team may be highly technical and more partial to programmatic tools. Here’s an overview of Great Expectations OSS, our open source solution for data quality.

On the other hand, if your infrastructure is cloud-ready, your data quality workflows involve a lot of remote collaboration, and you’re comfortable provisioning your team with appropriate data access permissions to maintain data security in the cloud, then a cloud-native solution may be for you. Learn about GX’s SaaS solution, GX Cloud, here.

Ask yourself: How much automation do I actually need?

Before you pursue any approach to automating your data profiling, it helps to consider how much automation your data project truly requires. If your pipeline isn’t automated—or your data isn’t particularly volatile or fast-moving—then your profiling might not need much automation either.

When running Checkpoints using GX, for example, you may require a full suite of automation, with alerts sent out anytime the system detects an irregularity in your data behavior. Or it may suffice to automate the Checkpoint run itself, but review the results manually instead of sending out alerts. The key is calibrating your level of automation to the speed at which your data moves and changes.

Conclusion

Just as data profiling itself is a multifaceted project, automating the process comes with plenty of its own considerations. Use these tips to find an approach that works for you and your data team.