We have Great Expectations for Pandas Profiling
Given that you are reading this post on greatexpectations.io, we assume you’re a Python Data Person (TM). As a Python Data Person (TM), you are probably familiar with Pandas. And as someone familiar with Pandas, we also believe you may already be familiar with Pandas Profiling, a fantastic open source library for, well, profiling your data set. We’ve collaborated with Simon Brugman, the core maintainer behind Pandas Profiling, to include a super handy “to Expectation Suite” method in the library, which turns your profiled report into a Great Expectations Expectation Suite that you can use to validate your data. If this all makes sense to you (or if you’ve been watching the original GitHub issue for a while) and you can’t wait to try it out, you can install the latest version of Pandas Profiling (version v2.11.0 at the time of writing this post) and hop over to the examples in the Pandas Profiling repo straight away to get started - otherwise, stick around and learn more about what exactly we’ve been up to!
What is Pandas Profiling?
Aww yeah, if you’re not familiar yet with Pandas Profiling, you’re in for a treat! It’s an open source Python library for profiling a Pandas dataframe. The README summarizes this pretty well:
The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.
This means that with a simple one-liner (
profile = ProfileReport(df)), you’ll get relevant information about your dataframe such as types, unique values, missing values, quantile statistics, descriptive statistics, most frequent values, etc. Pandas profiling also renders these stats into a beautiful HTML report, see the screenshot below. Sweet!
How does Profiling relate to Great Expectations?
You may or may not have already used the built-in profiling capabilities that come with Great Expectations, specifically when running the
suite scaffold command. This loads up a boilerplate notebook with our BasicSuiteBuilderProfiler, which automatically generates an Expectation Suite based on some lightweight profiling. For example, if our profiler finds that a column has no NULL values, it will create a
expect_column_values_to_not_be_null Expectation. Or, if it determines that a column only contains strings from a specific value set, say “apple”, “pear”, “orange”, it will create a respective value set Expectation:
expect_column_values_to_be_in_set(column=”fruit”, values=(“apple”, “pear”, “orange”)). Got it?
One big advantage of using an automated profiler to “scaffold” your suite is that you don’t have to write every single Expectation from hand. The other advantage is that the profiler can highlight properties of your data that you’re not even aware of - it makes some implicit knowledge explicit, and allows you to assert this in future data batches. Automated suite generation (or scaffolding - we believe having a human check the generated suite and make tweaks is always beneficial) simply takes some of the work off your plate. Neat!
Our profiler is very much in early stages of development (one may call it “experimental”), and while we’re actively working on making it a lot smarter, we figured it would be awesome to integrate with an already mature profiler library - and this is where Pandas Profiling comes into play.
How does Pandas Profiling integrate with Great Expectations?
This is where it gets exciting! We implemented a simple method on the Pandas Profiling
profiler object that allows you to generate an Expectation Suite, as well as run validation and build Data Docs if desired. Here’s a little snippet that demonstrates this:
import pandas as pdfrom pandas_profiling import ProfileReport# Load your dataframedf = pd.read_csv('yellow_tripdata_sample_2019-01.csv')# Then run Pandas Profilingprofile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)# And obtain an Expectation Suite from the profile reportsuite = profile.to_expectation_suite(suite_name="my_pandas_profiling_suite")
Boom, there you have it.
suite is now a Great Expectations ExpectationSuite object, which you can use directly in the code to validate another batch of data, or store to your Data Context. See the examples in the Pandas Profiling repo for complete working examples and configuration options! The integration also allows you to make use of Semantic Types via visions, which is part of Pandas Profiling. With visions, you can provide more fine-grained typing information to the profiler that abstracts from the underlying storage type of your data. For example, if your data contains a “user ID” column that’s stored as an integer, calculating the min, max, and mean of this column would not make a lot of sense. With visions, you can specify that the identifier column should simply be treated as a string, which in turn leads to fewer non-useful numeric Expectation types being generated.
And that’s it! We’re really excited about this super simple integration that we hope will be helpful in quickly generating smart Expectation Suites. If you would like to provide feedback on the integration, hop over to our Great Expectations Slack channel and say hi, or get involved as a Pandas Profiling contributor!