Building a Self-Updating Data Dictionary with Great Expectations (UPDATED)

Extending Great Expectations data documentation makes it possible to build customized sites that are dynamically updated based on new data.

Great Expectations
July 31, 2020
Great Expectations
July 31, 2020

This post has been updated and works as of

great_expectations 0.11.8
. originally posted October 04, 2019

Recently, we've been thinking a lot about data dictionaries at Superconductive Health. Our clients can have a broad range of technical expertise, and some clients ask us for quick summaries of their data in data dictionary form - e.g. a summary of the metadata about each column in a table. With that frequent request in mind, I really appreciated Alex Jia and Michael Kaminsky's recent post, "You probably don't need a data dictionary." They made some salient points about the maintenance burden becoming a scalability issue for data dictionaries with the logical conclusion that if a data dictionary needs to be manually maintained, it's probably not worth the effort, as it will require maintaining multiple sources of truth for your data instead of relying on one source of truth. But what if a data dictionary were updated automatically based on your single data source of truth?

This seemed like a great candidate for a Great Expectations plugin. We're already using GE as our lone source of truth for data documentation, and with the amount of information already captured in GE, we wouldn't need to capture any new information - rather, we just need to provide a new view into the information that we already have. At the same time, this data dictionary feature might not provide critical functionality for every GE user, so rather than building it into the core project, building it out as a plugin seemed like the ideal way to proceed. Plugins are a simple way to add modular functionality to GE without altering the core. Running

great_expectations init
to initialize a GE project automatically creates a plugins folder where you can drop bits of code to add or adjust functionality in GE. We can then adjust the project settings in our
file (also created upon project initialization) to tell GE to use our plugin instead of the base code.


Since our goal for building a data dictionary is to provide a new view into data that we already have, this falls into GE's rendering functionality - we don't need to process new information, but just to repackage the information we already have in a new way. For this example project, we're using the Overall Performance dataset which contains Merit-Based Incentive Payment System (MIPS) final scores and performance category scores for clinicians. You can export it as as a csv from the top of this page. As a reminder, this is what our basic expectation renderer looks like:


Our goal is to add a "Data Dictionary" section above the "Expectation Suite Overview" that will provide a table with some metadata about the columns of this CSV, like data type, nullability, possible values, etc. These columns will automatically populate based on our expectations, so as our data changes, the data dictionary will change without requiring manual maintenance. Let me say this one more time: Your data dictionaries will update automatically when your data documentation gets updated automatically - no manual intervention necessary. That solves the primary problem that Michael and Alex had identified with data dictionaries. Before we dive into the weeds, let's start by building a very simple plugin that adds a "Hello World" section to our already rendered documentation so that we can see what the plugin process is like. First, in our

file, under
, we will create a new key called
. Under this, we'll add a key called
, below that, a key called
, and below that we will add our

2 class_name: SiteBuilder
3 # set to false to hide how-to buttons in Data Docs
4 show_how_to_buttons: true
5 store_backend:
6 class_name: TupleFilesystemStoreBackend
7 base_directory: uncommitted/data_docs/local_site/
9 site_section_builders: #highlight-line
10 expectations: #highlight-line
11 renderer: #highlight-line
12 module_name: custom_data_docs.my_custom_renderer #highlight-line
13 class_name: MyCustomExpectationSuitePageRenderer #highlight-line

Next, we'll need to actually add the plugin. In the

directory, we'll create a file called
(to match the
from our
). At the top, we'll import the necessary pieces modules from core GE.

1from great_expectations.render.renderer import (
2 ExpectationSuitePageRenderer
4from great_expectations.render.types import (
5 RenderedDocumentContent,
6 RenderedSectionContent,
7 RenderedHeaderContent,
8 RenderedTableContent,

Then we'll create a class that inherits from the pre-existing

, and overwrite the
function - we'll start by pasting in the original function. Then we'll add a call to
at the start of
. Finally we'll create the
function below. The goal here is to prepend our new "Hello World" section to our existing rendering functionality.

1import pandas as pd
3from great_expectations.render.renderer import (
4 ExpectationSuitePageRenderer
6from great_expectations.render.types import (
7 RenderedDocumentContent,
8 RenderedSectionContent,
9 RenderedHeaderContent,
10 RenderedTableContent,
13class MyCustomExpectationSuitePageRenderer(ExpectationSuitePageRenderer):
14 def render(self, expectations):
15 columns, ordered_columns = self._group_and_order_expectations_by_column(
16 expectations
17 )
18 expectation_suite_name = expectations.expectation_suite_name
20 overview_content_blocks = [
21 self._render_data_dictionary_header(expectations), #highlight-line
22 self._render_expectation_suite_header(),
23 self._render_expectation_suite_info(expectations),
24 ]
26 table_level_expectations_content_block = self._render_table_level_expectations(
27 columns
28 )
29 if table_level_expectations_content_block is not None:
30 overview_content_blocks.append(table_level_expectations_content_block)
32 asset_notes_content_block = self._render_expectation_suite_notes(expectations)
33 if asset_notes_content_block is not None:
34 overview_content_blocks.append(asset_notes_content_block)
36 sections = [
37 RenderedSectionContent(
38 **{
39 "section_name": "Overview",
40 "content_blocks": overview_content_blocks,
41 }
42 )
43 ]
45 sections += [
46 self._column_section_renderer.render(expectations=columns[column])
47 for column in ordered_columns
48 if column != "_nocolumn"
49 ]
50 return RenderedDocumentContent(
51 **{
52 "renderer_type": "ExpectationSuitePageRenderer",
53 "page_title": expectation_suite_name,
54 "expectation_suite_name": expectation_suite_name,
55 "utm_medium": "expectation-suite-page",
56 "sections": sections,
57 }
58 )
60 def _render_data_dictionary_header(self, expectations):
61 return RenderedHeaderContent(**{ # !!
62 "content_block_type": "header",
63 "header": "hello world",
64 "styling": {
65 "classes": ["col-12"],
66 "header": {
67 "classes": ["alert", "alert-secondary"]
68 }
69 }
70 })

Now, when we go to the command line and run

great_expectations docs build
, we will rebuild our documentation with the new renderer. When we edited our
file, we changed the renderer for the
local site
, so we'll want to open that one up. When opening data docs, we will start on the index page, so we will click into an expectation suite (as we are editing an expectation suite renderer). And when we do, lo and behold! A shiny, new "hello world" section prepends our existing renderer.


Being able to change the renderer's output is pretty exciting. But in its current format, our plugin isn't very useful. Let's change that by configuring it to display a table with our columns, and any column descriptions. In order to do this, we'll go into our

file. When we run
great_expectations init
, GE profiles our datasource in order to provide us with a template for specifying our own expectations. These profiles are saved
files. Initially, these will be pretty generic, but as we fill in these profiles with our expectations, they will act as the source of truth from which our data dictionary will pull. Additionally, GE does a pretty good job of extracting data types, so we get that as a freebie. Let's open up
and take a look.


We can see that our

is divided into two main sections. There's a
section which contains some metadata for our datasource like column names and column descriptions (your descriptions might be empty), as well as a broader
section. Then there is our
section which is the bulk of this
file. It contains all of the expectations themselves. We'll dive deeper into this part later on in this post.

For now, let's try extracting the column names and column descriptions - these will make up the first two columns of our data dictionary. If we look back at the code for the "hello world" renderer that we built, we have most of the building blocks of our renderer in our

list. So far, all we have is a header - let's try adding the empty table that will be our data dictionary. To do that, we'll add a new
function. This will return a new RenderedTableContent object that will get added to the
list in our

1def render(self, expectations):
2 columns, ordered_columns = self._group_and_order_expectations_by_column(
3 expectations
4 )
5 expectation_suite_name = expectations.expectation_suite_name
7 overview_content_blocks = [
8 self._render_data_dictionary_header(expectations),
9 self._render_data_dictionary(expectations), #highlight-line
10 self._render_expectation_suite_header(),
11 self._render_expectation_suite_info(expectations),
12 ]
14 table_level_expectations_content_block = self._render_table_level_expectations(
15 columns
16 )
17 if table_level_expectations_content_block is not None:
18 overview_content_blocks.append(table_level_expectations_content_block)
20 asset_notes_content_block = self._render_expectation_suite_notes(expectations)
21 if asset_notes_content_block is not None:
22 overview_content_blocks.append(asset_notes_content_block)
24 sections = [
25 RenderedSectionContent(
26 **{
27 "section_name": "Overview",
28 "content_blocks": overview_content_blocks,
29 }
30 )
31 ]
33 sections += [
34 self._column_section_renderer.render(expectations=columns[column])
35 for column in ordered_columns
36 if column != "_nocolumn"
37 ]
38 return RenderedDocumentContent(
39 **{
40 "renderer_type": "ExpectationSuitePageRenderer",
41 "page_title": expectation_suite_name,
42 "expectation_suite_name": expectation_suite_name,
43 "utm_medium": "expectation-suite-page",
44 "sections": sections,
45 }
46 )
47def _render_data_dictionary(self, expectations): #highlight-line
48 return RenderedTableContent(**{ #highlight-line
49 "content_block_type": "table", #highlight-line
50 "header_row": ["Column Name", "Description"], #highlight-line
51 "header": "Data Dictionary", #highlight-line
52 "table": [], #highlight-line
53 "styling": { #highlight-line
54 "classes": ["col-12", "table-responsive"], #highlight-line
55 "styles": { #highlight-line
56 "margin-top": "20px", #highlight-line
57 "margin-bottom": "20px" #highlight-line
58 }, #highlight-line
59 "body": { #highlight-line
60 "classes": ["table", "table-sm"] #highlight-line
61 } #highlight-line
62 }, #highlight-line
63 }) #highlight-line

For now, we'll leave the

an empty list, and we'll specify our first two columns in the
("Column Name" and "Description"). Now when we rebuild our documentation (by running
great_expectations docs build
on the command line), you'll see the start of our new table with header, and header row.


Now what we'll do is replace the empty list that is our table with a pandas DataFrame. Then we can add each column to the dataframe as we parse it from the

, and slowly populate our data dictionary. Let's first build a function to get the table columns and descriptions. Because of the
section in our
file, this is a simple, one-line function.

1def _get_table_columns(self,expectations):
2 return expectations.meta.get("columns")

Next, we'll create the pandas DataFrame, and add our columns for Column Name and Description. Make sure that you're importing pandas at the top of the module. Then we'll use a list comprehension in conjunction with our

function to populate each of these columns. Also, let's change out the text in our header from "hello world" to "Data Dictionary." Lastly, be sure to replace your empty list in
with the values of your new DataFrame!

1def _render_data_dictionary(self, expectations):
2 data_dictionary_df = pd.DataFrame() #highlight-line
3 if self._get_table_columns(expectations): #highlight-line
4 data_dictionary_df["Column Name"] = [i for i in self._get_table_columns(expectations)] #highlight-line
5 data_dictionary_df["Description"] = [i["description"] for i in self._get_table_columns(expectations).values()] #highlight-line
7 return RenderedTableContent(**{
8 "content_block_type": "table",
9 "header_row": ["Column Name", "Description"],
10 "header": "Data Dictionary",
11 "table": data_dictionary_df.values, #highlight-line
12 "styling": {
13 "classes": ["col-12", "table-responsive"],
14 "styles": {
15 "margin-top": "20px",
16 "margin-bottom": "20px"
17 },
18 "body": {
19 "classes": ["table", "table-sm"]
20 }
21 },
22 })

When we rebuild our documentation, we can see our first two columns! At this point, you may have noticed that your "Description" column is empty. That's because the descriptions for each column in the "meta" section of our

are empty. And that is the idea here - as long as we keep our expectations up-to-date in the form of these
files, we can count on our data dictionary to be up-to-date. So if we add some descriptions in the
file, we'll start to fill in our data dictionary.


Now we can start parsing our column level expectations in order to populate some more columns in our data dictionary. We're going to be reaching into the

part of our
file in order to do this. If you've perused the
, you might have noticed that all of our expectations are at the same level - they are not organized under individual column sections. So first things first, we'll define a function that organizes our expectations by column. This will take in our expectations and use a filter to populate a dictionary where the keys are the names of our columns, and the values are a list of our expectations.

1def _sort_expectations_by_column(self, expectations):
2 expectations_by_column = {}
3 expectations_dictionary = expectations.expectations
4 column_names = list(self._get_table_columns(expectations).keys())
5 for column in column_names:
6 expectations_by_column[column] = list(filter(
7 lambda x: x.kwargs.get("column")==column,
8 expectations_dictionary))
9 return expectations_by_column

What column should we add next? Data types for each column would be useful, and if you specify that you would like column data typing during

suite scaffold
GE can do a great job of parsing our data source in order to guess the data type of each column. It saves this in the
expectation. Here we'll create another function for our data types column. Note: If you didn't specify this expectation in scaffolding, you can create these typing expectations manually for each column. Additional note: Depending on your needs, you may find yourself with an
expectation and you may need to modify this function slightly.

1def _get_column_data_types(self,expectations_by_column):
2 column_data_type_expectations = {}
3 for k,v in expectations_by_column.items():
4 expectation = [i for i in v if i["expectation_type"] == "expect_column_values_to_be_of_type"]
5 if len(expectation)>0:
6 type_list = expectation[0].kwargs.get("type_")
7 column_data_type_expectations[k] = type_list
8 else:
9 column_data_type_expectations[k] = None
10 return column_data_type_expectations

This will take in our sorted expectations by column, loop through each column, and if it has an

expectation, it will add the
from that expectation to a returned dictionary. We have some very basic error handling here in case we don't have an expectation for
or in case the
in the expectation is empty.

Using our

function, we can add the next column to our data dictionary. In our
function, we'll need to add a new column to our
, and then a new column name to the
list in the
object. I also know that I'll be adding more columns from our expectations, so I'll save our sorted expectations by column to a variable.

1def _render_data_dictionary(self, expectations):
2 data_dictionary_df = pd.DataFrame()
3 data_dictionary_df["Column Name"] = [i for i in self._get_table_columns(expectations)]
4 data_dictionary_df["Description"] = [i["description"] for i in self._get_table_columns(expectations).values()]
6 expectations_by_column = self._sort_expectations_by_column(expectations) #highlight-line
7 data_dictionary_df["Data Type"] = [i for i in self._get_column_data_types(expectations_by_column).values()] #highlight-line
9 return RenderedTableContent(**{
10 "content_block_type": "table",
11 "header_row": ["Column Name", "Description", "Data Type"], #highlight-line
12 "header": "Data Dictionary",
13 "table": data_dictionary_df.values,
14 "styling": {
15 "classes": ["col-12", "table-responsive"],
16 "styles": {
17 "margin-top": "20px",
18 "margin-bottom": "20px"
19 },
20 "body": {
21 "classes": ["table", "table-sm"]
22 }
23 },
24 })

When we rebuild our documentation, we can see our new data type column!


See the gist for all of the code we've written so far.

There are many more columns we can add based off of our expectation suite, like nullity, set values, mins and maxs. You can specify these columns when running

suite scaffold
or create these expectations manually. As you update your expectation suite, you can count on this data dictionary to stay current. If you add new expectations, like those for string length or regexs, you can add new columns to show those new attributes. And if you decide that you no longer want this data dictionary in your GE documentation, you can remove the plugin, or write a new one!

Final code for the plugin I built is here for reference but note that the blog doesn't cover everything!

Like our blogs?

Sign up for emails and get more blogs and news

Great Expectations email sign-up

Hello friend of Great Expectations!

Our email content features product updates from the open source platform and our upcoming Cloud product, new blogs and community celebrations.

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Banner Image

Search our blog for the latest on data management

©2023 Great Expectations. All Rights Reserved.