Building a Self-Updating Data Dictionary with Great Expectations (UPDATED)

Extending Great Expectations data documentation makes it possible to build customized sites that are dynamically updated based on new data. Here, we build a self-updating data dictionary.

July 31, 2020

This post has been updated and works as of great_expectations 0.11.8. originally posted October 04, 2019

Recently, we’ve been thinking a lot about data dictionaries at Superconductive Health. Our clients can have a broad range of technical expertise, and some clients ask us for quick summaries of their data in data dictionary form - e.g. a summary of the metadata about each column in a table. With that frequent request in mind, I really appreciated Alex Jia and Michael Kaminsky’s recent post, ”You probably don’t need a data dictionary.” They made some salient points about the maintenance burden becoming a scalability issue for data dictionaries with the logical conclusion that if a data dictionary needs to be manually maintained, it’s probably not worth the effort, as it will require maintaining multiple sources of truth for your data instead of relying on one source of truth. But what if a data dictionary were updated automatically based on your single data source of truth?

This seemed like a great candidate for a Great Expectations plugin. We’re already using GE as our lone source of truth for data documentation, and with the amount of information already captured in GE, we wouldn’t need to capture any new information - rather, we just need to provide a new view into the information that we already have. At the same time, this data dictionary feature might not provide critical functionality for every GE user, so rather than building it into the core project, building it out as a plugin seemed like the ideal way to proceed. Plugins are a simple way to add modular functionality to GE without altering the core. Running great_expectations init to initialize a GE project automatically creates a plugins folder where you can drop bits of code to add or adjust functionality in GE. We can then adjust the project settings in our great_expecations.yml file (also created upon project initialization) to tell GE to use our plugin instead of the base code.

plugin_dirtree

Since our goal for building a data dictionary is to provide a new view into data that we already have, this falls into GE’s rendering functionality - we don’t need to process new information, but just to repackage the information we already have in a new way. For this example project, we’re using the Overall Performance dataset which contains Merit-Based Incentive Payment System (MIPS) final scores and performance category scores for clinicians. You can export it as as a csv from the top of this page. As a reminder, this is what our basic expectation renderer looks like:

basic_renderer

Our goal is to add a “Data Dictionary” section above the “Expectation Suite Overview” that will provide a table with some metadata about the columns of this CSV, like data type, nullability, possible values, etc. These columns will automatically populate based on our expectations, so as our data changes, the data dictionary will change without requiring manual maintenance. Let me say this one more time: Your data dictionaries will update automatically when your data documentation gets updated automatically - no manual intervention necessary. That solves the primary problem that Michael and Alex had identified with data dictionaries. Before we dive into the weeds, let’s start by building a very simple plugin that adds a “Hello World” section to our already rendered documentation so that we can see what the plugin process is like. First, in our great_expecations.yml file, under data_docs_sites then local_site, we will create a new key called site_section_builders. Under this, we’ll add a key called expectations, below that, a key called renderer, and below that we will add our module_name and class_name.

local_site:
  class_name: SiteBuilder
  # set to false to hide how-to buttons in Data Docs
  show_how_to_buttons: true
  store_backend:
    class_name: TupleFilesystemStoreBackend
    base_directory: uncommitted/data_docs/local_site/

  site_section_builders:    expectations:      renderer:        module_name: custom_data_docs.my_custom_renderer        class_name: MyCustomExpectationSuitePageRenderer

Next, we’ll need to actually add the plugin. In the plugins/custom_data_docs/ directory, we’ll create a file called my_custom_renderer.py (to match the module_name from our great_expectations.yml). At the top, we’ll import the necessary pieces modules from core GE.

from great_expectations.render.renderer import (
    ExpectationSuitePageRenderer
)
from great_expectations.render.types import (
    RenderedDocumentContent,
    RenderedSectionContent,
    RenderedHeaderContent,
    RenderedTableContent,
)

Then we’ll create a class that inherits from the pre-existing ExpectationSuitePageRenderer, and overwrite the render function - we’ll start by pasting in the original function. Then we’ll add a call to _render_data_dictionary_header at the start of overview_content_blocks. Finally we’ll create the _render_data_dictionary_header function below. The goal here is to prepend our new “Hello World” section to our existing rendering functionality.

import pandas as pd

from great_expectations.render.renderer import (
    ExpectationSuitePageRenderer
)
from great_expectations.render.types import (
    RenderedDocumentContent,
    RenderedSectionContent,
    RenderedHeaderContent,
    RenderedTableContent,
)

class MyCustomExpectationSuitePageRenderer(ExpectationSuitePageRenderer):
    def render(self, expectations):
        columns, ordered_columns = self._group_and_order_expectations_by_column(
            expectations
        )
        expectation_suite_name = expectations.expectation_suite_name

        overview_content_blocks = [
            self._render_data_dictionary_header(expectations),            self._render_expectation_suite_header(),
            self._render_expectation_suite_info(expectations),
        ]

        table_level_expectations_content_block = self._render_table_level_expectations(
            columns
        )
        if table_level_expectations_content_block is not None:
            overview_content_blocks.append(table_level_expectations_content_block)

        asset_notes_content_block = self._render_expectation_suite_notes(expectations)
        if asset_notes_content_block is not None:
            overview_content_blocks.append(asset_notes_content_block)

        sections = [
            RenderedSectionContent(
                **{
                    "section_name": "Overview",
                    "content_blocks": overview_content_blocks,
                }
            )
        ]

        sections += [
            self._column_section_renderer.render(expectations=columns[column])
            for column in ordered_columns
            if column != "_nocolumn"
        ]
        return RenderedDocumentContent(
            **{
                "renderer_type": "ExpectationSuitePageRenderer",
                "page_title": expectation_suite_name,
                "expectation_suite_name": expectation_suite_name,
                "utm_medium": "expectation-suite-page",
                "sections": sections,
            }
        )

    def _render_data_dictionary_header(self, expectations):
        return RenderedHeaderContent(**{ # !!
            "content_block_type": "header",
            "header": "hello world",
            "styling": {
                "classes": ["col-12"],
                "header": {
                    "classes": ["alert", "alert-secondary"]
                }
            }
        })

Now, when we go to the command line and run great_expectations docs build, we will rebuild our documentation with the new renderer. When we edited our great_expectations.yml file, we changed the renderer for the local site, so we’ll want to open that one up. When opening data docs, we will start on the index page, so we will click into an expectation suite (as we are editing an expectation suite renderer). And when we do, lo and behold! A shiny, new “hello world” section prepends our existing renderer.

hello_world

Being able to change the renderer’s output is pretty exciting. But in its current format, our plugin isn’t very useful. Let’s change that by configuring it to display a table with our columns, and any column descriptions. In order to do this, we’ll go into our BasicDatasetProfiler.json file. When we run great_expectations init, GE profiles our datasource in order to provide us with a template for specifying our own expectations. These profiles are saved .json files. Initially, these will be pretty generic, but as we fill in these profiles with our expectations, they will act as the source of truth from which our data dictionary will pull. Additionally, GE does a pretty good job of extracting data types, so we get that as a freebie. Let’s open up great_expectations/expectations/mips__dir/default/physician_mips/BasicDatasetProfiler.json and take a look.

renderer_json

We can see that our BasicDatasetProfiler.json is divided into two main sections. There’s a "meta" section which contains some metadata for our datasource like column names and column descriptions (your descriptions might be empty), as well as a broader "notes" section. Then there is our "expectations" section which is the bulk of this .json file. It contains all of the expectations themselves. We’ll dive deeper into this part later on in this post.

For now, let’s try extracting the column names and column descriptions - these will make up the first two columns of our data dictionary. If we look back at the code for the “hello world” renderer that we built, we have most of the building blocks of our renderer in our content_blocks list. So far, all we have is a header - let’s try adding the empty table that will be our data dictionary. To do that, we’ll add a new _render_data_dictionary function. This will return a new RenderedTableContent object that will get added to the content_blocks list in our render function.

def render(self, expectations):
        columns, ordered_columns = self._group_and_order_expectations_by_column(
            expectations
        )
        expectation_suite_name = expectations.expectation_suite_name

        overview_content_blocks = [
            self._render_data_dictionary_header(expectations),
            self._render_data_dictionary(expectations),            self._render_expectation_suite_header(),
            self._render_expectation_suite_info(expectations),
        ]

        table_level_expectations_content_block = self._render_table_level_expectations(
            columns
        )
        if table_level_expectations_content_block is not None:
            overview_content_blocks.append(table_level_expectations_content_block)

        asset_notes_content_block = self._render_expectation_suite_notes(expectations)
        if asset_notes_content_block is not None:
            overview_content_blocks.append(asset_notes_content_block)

        sections = [
            RenderedSectionContent(
                **{
                    "section_name": "Overview",
                    "content_blocks": overview_content_blocks,
                }
            )
        ]

        sections += [
            self._column_section_renderer.render(expectations=columns[column])
            for column in ordered_columns
            if column != "_nocolumn"
        ]
        return RenderedDocumentContent(
            **{
                "renderer_type": "ExpectationSuitePageRenderer",
                "page_title": expectation_suite_name,
                "expectation_suite_name": expectation_suite_name,
                "utm_medium": "expectation-suite-page",
                "sections": sections,
            }
        )
def _render_data_dictionary(self, expectations):  return RenderedTableContent(**{      "content_block_type": "table",      "header_row": ["Column Name", "Description"],      "header": "Data Dictionary",      "table": [],      "styling": {          "classes": ["col-12", "table-responsive"],          "styles": {              "margin-top": "20px",              "margin-bottom": "20px"          },          "body": {              "classes": ["table", "table-sm"]          }      },  })

For now, we’ll leave the "table" an empty list, and we’ll specify our first two columns in the "header_row" (“Column Name” and “Description”). Now when we rebuild our documentation (by running great_expectations docs build on the command line), you’ll see the start of our new table with header, and header row.

headers_only

Now what we’ll do is replace the empty list that is our table with a pandas DataFrame. Then we can add each column to the dataframe as we parse it from the .json, and slowly populate our data dictionary. Let’s first build a function to get the table columns and descriptions. Because of the meta section in our .json file, this is a simple, one-line function.

def _get_table_columns(self,expectations):
    return expectations.meta.get("columns")

Next, we’ll create the pandas DataFrame, and add our columns for Column Name and Description. Make sure that you’re importing pandas at the top of the module. Then we’ll use a list comprehension in conjunction with our _get_table_columns function to populate each of these columns. Also, let’s change out the text in our header from “hello world” to “Data Dictionary.” Lastly, be sure to replace your empty list in table with the values of your new DataFrame!

def _render_data_dictionary(self, expectations):
  data_dictionary_df = pd.DataFrame()  if self._get_table_columns(expectations):    data_dictionary_df["Column Name"] = [i for i in self._get_table_columns(expectations)]    data_dictionary_df["Description"] = [i["description"] for i in self._get_table_columns(expectations).values()]
  return RenderedTableContent(**{
      "content_block_type": "table",
      "header_row": ["Column Name", "Description"],
      "header": "Data Dictionary",
      "table": data_dictionary_df.values,      "styling": {
          "classes": ["col-12", "table-responsive"],
          "styles": {
              "margin-top": "20px",
              "margin-bottom": "20px"
          },
          "body": {
              "classes": ["table", "table-sm"]
          }
      },
  })

When we rebuild our documentation, we can see our first two columns! At this point, you may have noticed that your “Description” column is empty. That’s because the descriptions for each column in the “meta” section of our BasicDatasetProfiler.json are empty. And that is the idea here - as long as we keep our expectations up-to-date in the form of these .json files, we can count on our data dictionary to be up-to-date. So if we add some descriptions in the .json file, we’ll start to fill in our data dictionary.

basic_dictionary

Now we can start parsing our column level expectations in order to populate some more columns in our data dictionary. We’re going to be reaching into the expectations part of our .json file in order to do this. If you’ve perused the .json, you might have noticed that all of our expectations are at the same level - they are not organized under individual column sections. So first things first, we’ll define a function that organizes our expectations by column. This will take in our expectations and use a filter to populate a dictionary where the keys are the names of our columns, and the values are a list of our expectations.

def _sort_expectations_by_column(self, expectations):
    expectations_by_column = {}
    expectations_dictionary = expectations.expectations
    column_names = list(self._get_table_columns(expectations).keys())
    for column in column_names:
        expectations_by_column[column] = list(filter(
            lambda x: x.kwargs.get("column")==column,
            expectations_dictionary))
    return expectations_by_column

What column should we add next? Data types for each column would be useful, and if you specify that you would like column data typing during suite scaffold GE can do a great job of parsing our data source in order to guess the data type of each column. It saves this in the expect_column_values_to_be_of_type expectation. Here we’ll create another function for our data types column. Note: If you didn’t specify this expectation in scaffolding, you can create these typing expectations manually for each column. Additional note: Depending on your needs, you may find yourself with an expect_column_values_to_be_in_type_list expectation and you may need to modify this function slightly.

def _get_column_data_types(self,expectations_by_column):
  column_data_type_expectations = {}
  for k,v in expectations_by_column.items():
    expectation = [i for i in v if i["expectation_type"] == "expect_column_values_to_be_of_type"]
    if len(expectation)>0:
        type_list = expectation[0].kwargs.get("type_")
        column_data_type_expectations[k] = type_list
    else:
        column_data_type_expectations[k] = None
  return column_data_type_expectations

This will take in our sorted expectations by column, loop through each column, and if it has an expect_column_values_to_be_of_type expectation, it will add the type_ from that expectation to a returned dictionary. We have some very basic error handling here in case we don’t have an expectation for expect_column_values_to_be_of_type or in case the type_ in the expectation is empty.

Using our _get_column_data_types function, we can add the next column to our data dictionary. In our _render_data_dictionary function, we’ll need to add a new column to our data_dictionary_df, and then a new column name to the "header_row" list in the RenderedTableContent object. I also know that I’ll be adding more columns from our expectations, so I’ll save our sorted expectations by column to a variable.

def _render_data_dictionary(self, expectations):
        data_dictionary_df = pd.DataFrame()
        data_dictionary_df["Column Name"] = [i for i in self._get_table_columns(expectations)]
        data_dictionary_df["Description"] = [i["description"] for i in self._get_table_columns(expectations).values()]

        expectations_by_column = self._sort_expectations_by_column(expectations)        data_dictionary_df["Data Type"] = [i for i in self._get_column_data_types(expectations_by_column).values()]
        return RenderedTableContent(**{
            "content_block_type": "table",
            "header_row": ["Column Name", "Description", "Data Type"],            "header": "Data Dictionary",
            "table": data_dictionary_df.values,
            "styling": {
                "classes": ["col-12", "table-responsive"],
                "styles": {
                    "margin-top": "20px",
                    "margin-bottom": "20px"
                },
                "body": {
                    "classes": ["table", "table-sm"]
                }
            },
        })

When we rebuild our documentation, we can see our new data type column!

final_dictionary

See the gist for all of the code we’ve written so far.

There are many more columns we can add based off of our expectation suite, like nullity, set values, mins and maxs. You can specify these columns when running suite scaffold or create these expectations manually. As you update your expectation suite, you can count on this data dictionary to stay current. If you add new expectations, like those for string length or regexs, you can add new columns to show those new attributes. And if you decide that you no longer want this data dictionary in your GE documentation, you can remove the plugin, or write a new one!

Final code for the plugin I built is here for reference but note that the blog doesn’t cover everything!

Greetings! Have any questions about using Great Expectations? Join us onSlack
Have something to say about our blog? Shout it from the rooftops!
The Great Expectations Team

You should star us on  Github

Greetings! Have any questions about using Great Expectations? Join us onSlack