Like our blogs?
Join our newsletter and get more blogs and news
Great Expectations Newsletter and Updates Sign-up
Hello friend of Great Expectations!
Our newsletter content will feature product updates from the open-source platform and our upcoming Cloud product, new blogs and community celebrations.
Please let us know what matters to you in regards to your use (or potential use) of Great Expectations below. We want to make sure that we keep you informed and notified only about what matters to you.
What is data profiling?
Data profiling is one of the first terms you come across when you venture into practically any data-related field: data science, analytics, machine learning. But what exactly does it mean?
In this article, we’ll explore the definition of data profiling.
Data profiling definition: What is data profiling?
Data profiling is the act of evaluating your data and its state, and it is essential.
During data profiling, you identify and measure the characteristics of your data that are meaningful to you: the structural, content, and relationship information that tells you what you need to know in order to appropriately use the dataset.
What is data profiling used for?
The better question is: what isn’t data profiling used for?
Data profiling is the cornerstone of just about any business intelligence, data science, or data quality initiative.
It’s how you answer fundamental questions like:
What is this data about?
Is it the data I thought I had?
Does it contain what I expected it to?
Does it contain anything unexpected?
Is there a different amount of it than I expected?
How is it formatted?
Am I parsing it in a way that makes sense?
What’s the quality of the data?
Is it complete?
Does it make sense?
Is it accurate?
The answers to these questions (and others you ask) direct the next steps of your data project.
From the data profile, you’ve identified what you need:
To gather more data?
To investigate the meaning of any of the data?
A different tool for the data?
Data quality repairs on the data?
In essence, data profiling tells you whether you’re ready to move to the next stage of your project. And if you’re not, data profiling tells you what directions you need to go to become ready.
Usually, data profiling is talked about in the context of broad, company-wide business intelligence or data quality projects. But any downstream project can benefit from data profiling.
And in fact, you’re probably already doing data profiling without realizing it: if you glance at your data to see if it ‘looks right,’ that’s the most basic form of data profiling.
Of course, a more thorough data profiling is going to give you better information. So let’s discuss how to do that.
Types of data profiling
How would you answer the question “What is this data like?”
There’s a lot of options, but they fall into three main categories: content, structure, and relationships.
Structural data profiling
Structure data profiling describes the meta-arrangement of the data. This includes basic parsing information like the file or database format, delimiters, column and row order, and column names. You can also look at properties like value patterns and statistics when profiling data structure.
At first glance, structural profiling might look simple enough that pointing it out separately seems like a reach. It’s true that for a lot of data sources, structural data profiling is relatively straightforward.
But even in those simple cases, consider a minor change to a data source, like the order of a timestamp and an ID number column being switched. If you don’t know about that, suddenly ID number data is being sent where timestamps should have gone, and vice versa—and your data pipeline is now in chaos.
Note that there are several ways this kind of change could be detected: using column headers, type prediction, column-level statistics, or pattern identification, among other techniques. Statistics and pattern identification, in particular, can act as important data quality guardrails at this point in the data profiling process.
This type of structural issue can be acute for data practitioners working with files generated by external partner organizations. With no control over the external partner, and the external partner working entirely to their own priorities, there can be significant file structure changes regularly.
And in many cases, the structural data profile isn’t simple.
For example, a CSV file could have one or more columns where the column’s data is multiple values in JSON format—a scenario common enough that major data warehouses like Snowflake provide specific documentation for it.
Knowing whether any complex columns like this exist, and if so where they are, is critical to effective use of a data source.
Content data profiling
Content data profiling is when you examine the semantic values of the data.
This includes evaluating the data’s completeness and types, as well as less tangible qualities like accuracy and validity.
When doing data profiling for completeness, it’s important for all interested parties to be on the same page about how completeness is being defined.
Data can be incomplete at the field, column, row, and source levels. Data that is incomplete because all the timestamps have been replaced with a null is a materially different issue from data that is incomplete because it’s supposed to be for all of Q1 but only February timestamps are present.
Type checking can play an extremely important role in data profiling by identifying when sensitive data like personally identifiable information (PII) is present, so you can ensure that appropriate controls are in place.
In addition to checking that the expected kind of data is present—and that unexpected data isn’t—type detection and profiling can also play a role in structural data profiling, acting as a way to detect structural changes in the absence of meaningful column headers.
When profiling data’s accuracy and validity, the difference between ‘data profiling’ and ‘data quality’ can become extremely blurry. There’s no universally-applicable divide between ‘profiling’ and ‘quality’ when it comes to accuracy and validity evaluation, so if you need to make a distinction, do so in whatever way makes the most sense for your workflows.
Relationship data profiling
Relationship data profiling looks at the connections between individual fields and rows in the data. It’s sometimes referred to as relationship discovery.
Within records, relationship data profiling can check things like: does the address data in this record actually form a valid address? Address validation is an extremely common form of relationship data profiling, and is frequently supplemented by external address validation services even if the rest of the data profiling is done in-house.
Relationship data profiling can even extend across datasets; concepts like householding are usually not supported by reliable keys in the raw data, and require the complex understanding of entities as a whole that relationship data profiling can provide.
Where it fits
Data profiling is a key part of just about any data activity you can name. Unfortunately, naming data activities can be tricky—many terms are used in different ways by different people, organizations, and industries.
In this section, we’ll explore some common terms that are sometimes used synonymously with data profiling, and in what ways they differ.
Data profiling vs data mining
Data profiling and data mining are fairly similar in terms of scope, but their goals are different.
When doing data profiling, you are aiming to understand what the data is—essentially, you’re meeting the data where it’s at. Regardless of your ultimate goals for the data, data profiling is where you evaluate whether the data is suitable to use for achieving those goals.
Data mining is a step or two further down the road: it’s one of the ‘ultimate goals’ that you might have for your data. Specifically, data mining is often predictive work, identifying the patterns and trends from your current data and using those to draw conclusions that you act on.
Data profiling is crucial to data mining because data mining is a classic garbage-in-garbage-out scenario: if you don’t profile your data, you have no idea whether your data mining results are trustworthy.
Data profiling vs data governance
Data profiling and data governance are both ongoing activities, but data governance has a much wider scope than data profiling.
Profiling, by its nature, can only include data that exists and that the data practitioner has access to. Data governance, on the other hand, can extend to a myriad of policies and behaviors surrounding the data, including:
Access monitoring and control
Data retention policies
Data collection policies
Data quality monitoring and remediation
Many data governance activities requires data profiling—for example, to detect PII so that appropriate access control can be implemented—but data profiling is definitively a support to data governance, and not vice versa.
Data profiling vs data discovery
Data discovery, like data profiling, is not well defined.
However, in most definitions data discovery is considered to be more of a data-practitioner-centric action: it’s about identifying what data is available to you as an individual, and then selecting the appropriate data for whatever your project is.
You would use the results of data profiling to make decisions during your data discovery.
Data profiling vs data assessment
Data profiling and data assessment are extremely complementary activities, and in many workflows there won’t be much, if any, distinction between them. You might also see data assessment expressed as data quality assessment.
You can think of data assessment as the thing you do when comparing the results of data profiling to defined standards and identifying whether the data is ‘good’ or not based on the results of that comparison.
Data profiling vs data quality analysis
This is another tricky term. Regardless of the precise definition you choose for data quality analysis, its relationship to data profiling remains the same: data profiling is something you do in the course of a data quality analysis.
Data quality analysis is often used synonymously with data quality assessment, and there are many ways it can be defined. As with data quality assessment, data quality analysis is generally about comparing data to a standard and determining whether the standard is met.
If you want to draw a distinction between analysis and assessment, one way to do that is to assign them different scopes, with data assessment performed essentially at the data source level and data quality analysis performed on the organization’s data pipelines and operations as a whole.
Data profiling challenges
Even once you’ve decided what kind of data profiling you need, implementing it has its own challenges.
There are many points in a data pipeline where it can make sense to carry out some kind of data profiling, some of the most common being: when the data enters your systems, when data is consolidated in a lake or warehouse, and when data is pulled for queries or operational use.
It’s important to find a data profiling solution that you can apply consistently everywhere in your data architecture that you need it to be.
Nontechnical subject matter experts are one of your most valuable resources for determining the expected characteristics of the data and knowing whether changes in the data are acceptable and/or expected.
A data profiling solution needs to have a way for you to collaborate with these users so that they can remain continuously in the loop about what your profiling is doing and discovering.
Effective data profiling means you need to implement nuanced evaluations of a lot of data in a lot of different places.
The best data profiling solutions will provide you with a library of editable resources for creating your data profiling checks, to minimize the work you need to do from scratch.
Data profiling best practices
The idea of an already-fully-configured data profiling solution can be extremely tempting, particularly if that solution promises that anomaly detection is already built-in and ready to go. But if that comes with a black box, the effort it claimed to save up front quickly makes itself known just a little bit downstream.
When everything is going well, opaqueness isn’t as much of an issue. But as soon as something goes wrong—for example, a false positive alert—it’s important to be able to identify exactly what happened and implement a way to stop it from happening again.
With a black box, that root cause analysis is difficult if not outright impossible. Transparency is essential if you want your data profiling and data quality to stay relevant and trusted on a long-term basis.
To whatever degree your data pipeline is automated, your data profiling needs to be automated—if not even more. If there’s significant lag between your data profile and the movement of your actual data, any knowledge you would have gained from data profiling is out-of-date before you can even see it.
Automation is key to keeping your data profiling results and documentation current and therefore maximizing its value.
Avoiding institutional knowledge that isn’t written down is an important part of transparency in data profiling, but it deserve its own category because it can be so difficult to do effectively.
To be most effective, documentation needs to be readable by everyone who has a stake in the data: developers, analysts, data engineers and scientists, and nontechnical professionals. That means not only should the documentation not use an obscure language or syntax, it shouldn’t depend on any knowledge of programming at all.
Documentation is also only useful if it’s actually up-to-date. Data pipelines can be rapidly-evolving architectures, so data profiling documentation (and data quality documentation in general) needs to be able to keep pace.
Looking for a data profiling tool?
If you’re looking for a data profiling tool, check out Great Expectations.
GX is an open source data quality platform, which you can use to automate your data profiling using Data Assistants and gain a complete picture of your data’s key characteristics.
Using GX, you can draw on an deep library of pre-built Expectations: declarative expressions that are flexible, extensible ways to make assertions about what you expect your data to be like.
Defining your data profiling using Expectations not only affords you complete transparency into your data profiling’s operations and alerting, but also constantly-refreshed documentation. Every test run in GX generates human-readable documentation suitable for use by nontechnical professionals, ensuring that your documentation never goes stale.
You can run GX by itself, or fold it into your existing data architecture through one of our integration partners.
See it for yourself by getting started with GX Open Source. Or join our Slack channel—we have a vibrant space for the fastest-growing data quality community in the world.