Building a Self-Updating Data Dictionary with Great Expectations

October 04, 2019

Recently, we’ve been thinking a lot about data dictionaries at Superconductive Health. Our clients can have a broad range of technical expertise, and some clients ask us for quick summaries of their data in data dictionary form - e.g. a summary of the metadata about each column in a table. With that frequent request in mind, I really appreciated Alex Jia and Michael Kaminsky’s recent post, ”You probably don’t need a data dictionary.” They made some salient points about the maintenance burden becoming a scalability issue for data dictionaries with the logical conclusion that if a data dictionary needs to be manually maintained, it’s probably not worth the effort, as it will require maintaining multiple sources of truth for your data instead of relying on one source of truth. But what if a data dictionary were updated automatically based on your single data source of truth?

This seemed like a great candidate for a Great Expectations plugin. We’re already using GE as our lone source of truth for data documentation, and with the amount of information already captured in GE, we wouldn’t need to capture any new information - rather, we just need to provide a new view into the information that we already have. At the same time, this data dictionary feature might not provide critical functionality for every GE user, so rather than building it into the core project, building it out as a plugin seemed like the ideal way to proceed. Plugins are a simple way to add modular functionality to GE without altering the core. Running great_expectations init to initialize a GE project automatically creates a plugins folder where you can drop bits of code to add or adjust functionality in GE. We can then adjust the project settings in our great_expecations.yml file (also created upon project initialization) to tell GE to use our plugin instead of the base code.

plugin_dirtree

Since our goal for building a data dictionary is to provide a new view into data that we already have, this falls into GE’s rendering functionality - we don’t need to process new information, but just to repackage the information we already have in a new way. For this example project, we’re using the Overall Performance dataset which contains Merit-Based Incentive Payment System (MIPS) final scores and performance category scores for clinicians. You can export it as as a csv from the top of this page. As a reminder, this is what our basic expectation renderer looks like:

basic_renderer

Our goal is to add a “Data Dictionary” section above the “Expectation Suite Overview” that will provide a table with some metadata about the columns of this CSV, like data type, nullability, possible values, etc. These columns will automatically populate based on our expectations, so as our data changes, the data dictionary will change without requiring manual maintenance. Let me say this one more time: Your data dictionaries will update automatically when your data documentation gets updated automatically - no manual intervention necessary. That solves the primary problem that Michael and Alex had identified with data dictionaries. Before we dive into the weeds, let’s start by building a very simple plugin that adds a “Hello World” section to our already rendered documentation so that we can see what the plugin process is like. First, in our great_expecations.yml file, under site_section_builders, we’ll change out the module_name and class_name for the expectations renderer. For now, we’ve just commented out the previous renderer so we can easily switch between them.

data_context_config

Next, we’ll need to actually add the plugin. In our plugins folder, we’ll create a file called my_custom_renderer.py (to match the module_name from our great_expectations.yml). At the top, we’ll import the necessary pieces modules from core GE.

from great_expectations.render.renderer import (
    ExpectationSuitePageRenderer 
)
from great_expectations.render.types import (
    RenderedDocumentContent,
    RenderedSectionContent,
    RenderedComponentContent,
)

Then we’ll create a class that inherits from the pre-existing ExpectationSuitePageRenderer, and get the render function from super(). The goal here is to prepend our new “Hello World” section to our existing rendering functionality.

from great_expectations.render.renderer import (
    ExpectationSuitePageRenderer
)
from great_expectations.render.types import (
    RenderedDocumentContent,
    RenderedSectionContent,
    RenderedComponentContent,
)

class MyCustomExpectationSuitePageRenderer(ExpectationSuitePageRenderer):
    def render(self, expectations):
        rendered_document_content = super().render(expectations)

        content_blocks = [
            self._render_data_dictionary_header(expectations),
        ]

        rendered_document_content["sections"] = [
            RenderedSectionContent(**{
                "section_name": "Hello World!!",
                "content_blocks": content_blocks
            })
        ] + rendered_document_content["sections"]

        return rendered_document_content

    def _render_data_dictionary_header(self, expectations):
        return RenderedComponentContent(**{
            "content_block_type": "header",
            "header": "hello world",
            "styling": {
                "classes": ["col-12"],
                "header": {
                    "classes": ["alert", "alert-secondary"]
                }
            }
        })

Now, when we go to the command line and run great_expectations build-docs, we will rebuild our documentation with the new renderer. When we edited our great_expectations.yml file, we changed the renderer for the local site, so we’ll want to open that one up. And when we do, lo and behold! A shiny, new “hello world” section prepending our existing renderer.

hello_world

Being able to change the renderer’s output is pretty exciting. But in its current format, our plugin isn’t very useful. Let’s change that by configuring it to display a table with our columns, and any column descriptions. In order to do this, we’ll go into our BasicDatasetProfiler.json file. When we run great_expectations init, GE profiles our datasource in order to provide us with a template for specifying our own expectations. These profiles are saved .json files. Initially, these will be pretty generic, but as we fill in these profiles with our expectations, they will act as the source of truth from which our data dictionary will pull. Additionally, GE does a pretty good job of extracting data types, so we get that as a freebie. Let’s open up great_expectations/expectations/mips__dir/default/physician_mips/BasicDatasetProfiler.json and take a look.

renderer_json

We can see that our BasicDatasetProfiler.json is divided into two main sections. At the beginning, there’s a "meta" section which contains some metadata for our datasource like column names and column descriptions (your descriptions might be empty), as well as a broader "notes" section. This is followed by our "expectations" section which is the bulk of this .json file. It contains all of the expectations themselves. We’ll dive deeper into this part later on in this post.

For now, let’s try extracting the column names and column descriptions - these will make up the first two columns of our data dictionary. If we look back at the code for the “hello world” renderer that we built, we have most of the building blocks of our renderer in our content_blocks list. So far, all we have is a header - let’s try adding the empty table that will be our data dictionary. To do that, we’ll add a new _render_data_dictionary function. This will return a new RenderedComponentContent object that will get added to the content_blocks list in our render function.

def render(self, expectations):
        rendered_document_content = super().render(expectations)

        content_blocks = [
            self._render_data_dictionary_header(expectations),
            self._render_data_dictionary(expectations),
        ]

        rendered_document_content["sections"] = [
            RenderedSectionContent(**{
                "section_name": "Data Dictionary",
                "content_blocks": content_blocks
            })
        ] + rendered_document_content["sections"]

        return rendered_document_content

def _render_data_dictionary(self, expectations):        return RenderedComponentContent(**{            "content_block_type": "table",            "header_row": ["Column Name", "Description"],            "header": "Data Dictionary",            "table": [],            "styling": {                "classes": ["col-12", "table-responsive"],                "styles": {                    "margin-top": "20px",                    "margin-bottom": "20px"                },                "body": {                    "classes": ["table", "table-sm"]                }            },        })

For now, we’ll leave the "table" an empty list, and we’ll specify our first two columns in the "header_row" (“Column Name” and “Description”). Now when we rebuild our documentation (by running great_expectations build-documentation on the command line), you’ll see the start of our new table with header, and header row.

headers_only

Now what we’ll do is replace the empty list that is our table with a pandas DataFrame. Then we can add each column to the dataframe as we parse it from the .json, and slowly populate our data dictionary. Let’s first build a function to get the table columns and descriptions. Because of the meta section in our .json file, this is a simple, one-line function.

def _get_table_columns(self,expectations):
    return expectations.get("meta").get("columns")

Next, we’ll create the pandas DataFrame, and add our columns for Column Name and Description. Make sure that you’re importing pandas at the top of the module. Then we’ll use a list comprehension in conjunction with our _get_table_columns function to populate each of these columns. Also, let’s change out the text in our header from “hello world” to “Data Dictionary.” Lastly, be sure to replace your empty list in table with the values of your new DataFrame!

def _render_data_dictionary(self, expectations):
        data_dictionary_df = pd.DataFrame()        data_dictionary_df["Column Name"] = [i for i in self._get_table_columns(expectations)]        data_dictionary_df["Description"] = [i["description"] for i in self._get_table_columns(expectations).values()]
        return RenderedComponentContent(**{
            "content_block_type": "table",
            "header_row": ["Column Name", "Description"],
            "header": "Data Dictionary",
            "table": data_dictionary_df.values,            "styling": {
                "classes": ["col-12", "table-responsive"],
                "styles": {
                    "margin-top": "20px",
                    "margin-bottom": "20px"
                },
                "body": {
                    "classes": ["table", "table-sm"]
                }
            },
        })

When we rebuild our documentation, we can see our first two columns! At this point, you may have noticed that your “Description” column is empty. That’s because the descriptions for each column in the “meta” section of our BasicDatasetProfiler.json are empty. And that is the idea here - as long as we keep our expectations up-to-date in the form of these .json files, we can count on our data dictionary to be up-to-date. So if we add some descriptions in the .json file, we’ll start to fill in our data dictionary.

basic_dictionary

Now we can start parsing our column level expectations in order to populate some more columns in our data dictionary. We’re going to be reaching into the expectations part of our .json file in order to do this. If you’ve perused the .json, you might have noticed that all of our expectations are at the same level - they are not organized under individual column sections. So first things first, we’ll define a function that organizes our expectations by column. This will take in our expectations and use a filter to populate a dictionary where the keys are the names of our columns, and the values are a list of our expectations.

def _sort_expectations_by_column(self, expectations):
        expectations_by_column = {}
        expectations_dictionary = expectations.get("expectations")
        column_names = list(self._get_table_columns(expectations).keys())
        for column in column_names:
            expectations_by_column[column] = list(filter(
                lambda x: x.get("kwargs").get("column")==column,
                expectations_dictionary))
        return expectations_by_column

What column should we add next? Data types for each column would be useful, and GE does a great job of parsing our data source during profiling in order to guess the data type of each column. It saves this in the expect_column_values_to_be_in_type_list expectation, which every column fresh from profiling should have. Here we’ll create another function for our data types column.

def _get_column_data_types(self,expectations_by_column):
        column_data_type_expectations = {}
        for k,v in expectations_by_column.items():
            expectation = [i for i in v if i["expectation_type"] == "expect_column_values_to_be_in_type_list"]
            if len(expectation)>0:
                type_list = expectation[0].get("kwargs").get("type_list")
                if len(type_list) > 0:
                    column_data_type_expectations[k] = type_list[0]
                else:
                    column_data_type_expectations[k] = type_list
            else:
                column_data_type_expectations[k] = None
        return column_data_type_expectations

This will take in our sorted expectations by column, loop through each column, and if it has an expect_column_values_to_be_in_type_list expectation, it will add the type_list from that expectation to a returned dictionary. We have some very basic error handling here, in case we don’t have an expectation for expect_column_values_to_be_in_type_list, or in case the type_list in the expectation is empty. Also, the type_list is set up to contain a list of several different versions of a given type based on the datasource (‘TEXT’, ‘string’, ‘str’, ‘VARCHAR’, etc.), so for this simple example, we’ll only pull the first item in this list. If we wanted, we could add a lookup function to return a different version of the data type for a different data source.

Using our _get_column_data_types function, we can add the next column to our data dictionary. In our _render_data_dictionary function, we’ll need to add a new column to our data_dictionary_df, and then a new column name to the "header_row" list in the RenderedComponentContent object. I also know that I’ll be adding more columns from our expectations, so I’ll save our sorted expectations by column to a variable.

def _render_data_dictionary(self, expectations):
        data_dictionary_df = pd.DataFrame()
        data_dictionary_df["Column Name"] = [i for i in self._get_table_columns(expectations)]
        data_dictionary_df["Description"] = [i["description"] for i in self._get_table_columns(expectations).values()]

        expectations_by_column = self._sort_expectations_by_column(expectations)        data_dictionary_df["Data Type"] = [i for i in self._get_column_data_types(expectations_by_column).values()]
        return RenderedComponentContent(**{
            "content_block_type": "table",
            "header_row": ["Column Name", "Description", "Data Type"],            "header": "Data Dictionary",
            "table": data_dictionary_df.values,
            "styling": {
                "classes": ["col-12", "table-responsive"],
                "styles": {
                    "margin-top": "20px",
                    "margin-bottom": "20px"
                },
                "body": {
                    "classes": ["table", "table-sm"]
                }
            },
        })

When we rebuild our documentation, we can see our new data type column!

final_dictionary

See the gist for all of the code we’ve written so far.

There are many more columns we can add just off of GE’s basic profiling, like nullity, set values, mins and maxs. Most of these will have the default values from GE profiling (for instance, nullity will come back with an acceptable level of 50% by default) if you haven’t edited your expectations, but this is where this dynamic data dictionary will shine - as you update your expectation suite, you can count on this data dictionary to stay current. If you add new expectations, like those for string length or regexs, you can add new columns to show those new attributes. And if you decide that you no longer want this data dictionary in your GE documentation, you can remove the plugin, or write a new one!

Final code for the plugin I built is here for reference but note that the blog doesn’t cover everything!

Have something to say about our blog? Shout it from the rooftops!
The Great Expectations Team

Written by The Great Expectations Team
You should follow us on Twitter