backgroundImage

Building a Self-Updating Data Dictionary with Great Expectations (UPDATED)

Extending Great Expectations data documentation makes it possible to build customized sites that are dynamically updated based on new data.

Great Expectations
July 31, 2020
dictionary-header

This post has been updated and works as of

great_expectations 0.11.8
. originally posted October 04, 2019

Recently, we've been thinking a lot about data dictionaries at Superconductive Health. Our clients can have a broad range of technical expertise, and some clients ask us for quick summaries of their data in data dictionary form - e.g. a summary of the metadata about each column in a table. With that frequent request in mind, I really appreciated Alex Jia and Michael Kaminsky's recent post, "You probably don't need a data dictionary." They made some salient points about the maintenance burden becoming a scalability issue for data dictionaries with the logical conclusion that if a data dictionary needs to be manually maintained, it's probably not worth the effort, as it will require maintaining multiple sources of truth for your data instead of relying on one source of truth. But what if a data dictionary were updated automatically based on your single data source of truth?

This seemed like a great candidate for a Great Expectations plugin. We're already using GE as our lone source of truth for data documentation, and with the amount of information already captured in GE, we wouldn't need to capture any new information - rather, we just need to provide a new view into the information that we already have. At the same time, this data dictionary feature might not provide critical functionality for every GE user, so rather than building it into the core project, building it out as a plugin seemed like the ideal way to proceed. Plugins are a simple way to add modular functionality to GE without altering the core. Running

great_expectations init
to initialize a GE project automatically creates a plugins folder where you can drop bits of code to add or adjust functionality in GE. We can then adjust the project settings in our
great_expecations.yml
file (also created upon project initialization) to tell GE to use our plugin instead of the base code.

plugin_dirtree

Since our goal for building a data dictionary is to provide a new view into data that we already have, this falls into GE's rendering functionality - we don't need to process new information, but just to repackage the information we already have in a new way. For this example project, we're using the Overall Performance dataset which contains Merit-Based Incentive Payment System (MIPS) final scores and performance category scores for clinicians. You can export it as as a csv from the top of this page. As a reminder, this is what our basic expectation renderer looks like:

basic_renderer

Our goal is to add a "Data Dictionary" section above the "Expectation Suite Overview" that will provide a table with some metadata about the columns of this CSV, like data type, nullability, possible values, etc. These columns will automatically populate based on our expectations, so as our data changes, the data dictionary will change without requiring manual maintenance. Let me say this one more time: Your data dictionaries will update automatically when your data documentation gets updated automatically - no manual intervention necessary. That solves the primary problem that Michael and Alex had identified with data dictionaries. Before we dive into the weeds, let's start by building a very simple plugin that adds a "Hello World" section to our already rendered documentation so that we can see what the plugin process is like. First, in our

great_expecations.yml
file, under
data_docs_sites
then
local_site
, we will create a new key called
site_section_builders
. Under this, we'll add a key called
expectations
, below that, a key called
renderer
, and below that we will add our
module_name
and
class_name
.

1local_site:
2 class_name: SiteBuilder
3 # set to false to hide how-to buttons in Data Docs
4 show_how_to_buttons: true
5 store_backend:
6 class_name: TupleFilesystemStoreBackend
7 base_directory: uncommitted/data_docs/local_site/
8
9 site_section_builders: #highlight-line
10 expectations: #highlight-line
11 renderer: #highlight-line
12 module_name: custom_data_docs.my_custom_renderer #highlight-line
13 class_name: MyCustomExpectationSuitePageRenderer #highlight-line
14

Next, we'll need to actually add the plugin. In the

plugins/custom_data_docs/
directory, we'll create a file called
my_custom_renderer.py
(to match the
module_name
from our
great_expectations.yml
). At the top, we'll import the necessary pieces modules from core GE.

1from great_expectations.render.renderer import (
2 ExpectationSuitePageRenderer
3)
4from great_expectations.render.types import (
5 RenderedDocumentContent,
6 RenderedSectionContent,
7 RenderedHeaderContent,
8 RenderedTableContent,
9)
10

Then we'll create a class that inherits from the pre-existing

ExpectationSuitePageRenderer
, and overwrite the
render
function - we'll start by pasting in the original function. Then we'll add a call to
_render_data_dictionary_header
at the start of
overview_content_blocks
. Finally we'll create the
_render_data_dictionary_header
function below. The goal here is to prepend our new "Hello World" section to our existing rendering functionality.

1import pandas as pd
2
3from great_expectations.render.renderer import (
4 ExpectationSuitePageRenderer
5)
6from great_expectations.render.types import (
7 RenderedDocumentContent,
8 RenderedSectionContent,
9 RenderedHeaderContent,
10 RenderedTableContent,
11)
12
13class MyCustomExpectationSuitePageRenderer(ExpectationSuitePageRenderer):
14 def render(self, expectations):
15 columns, ordered_columns = self._group_and_order_expectations_by_column(
16 expectations
17 )
18 expectation_suite_name = expectations.expectation_suite_name
19
20 overview_content_blocks = [
21 self._render_data_dictionary_header(expectations), #highlight-line
22 self._render_expectation_suite_header(),
23 self._render_expectation_suite_info(expectations),
24 ]
25
26 table_level_expectations_content_block = self._render_table_level_expectations(
27 columns
28 )
29 if table_level_expectations_content_block is not None:
30 overview_content_blocks.append(table_level_expectations_content_block)
31
32 asset_notes_content_block = self._render_expectation_suite_notes(expectations)
33 if asset_notes_content_block is not None:
34 overview_content_blocks.append(asset_notes_content_block)
35
36 sections = [
37 RenderedSectionContent(
38 **{
39 "section_name": "Overview",
40 "content_blocks": overview_content_blocks,
41 }
42 )
43 ]
44
45 sections += [
46 self._column_section_renderer.render(expectations=columns[column])
47 for column in ordered_columns
48 if column != "_nocolumn"
49 ]
50 return RenderedDocumentContent(
51 **{
52 "renderer_type": "ExpectationSuitePageRenderer",
53 "page_title": expectation_suite_name,
54 "expectation_suite_name": expectation_suite_name,
55 "utm_medium": "expectation-suite-page",
56 "sections": sections,
57 }
58 )
59
60 def _render_data_dictionary_header(self, expectations):
61 return RenderedHeaderContent(**{ # !!
62 "content_block_type": "header",
63 "header": "hello world",
64 "styling": {
65 "classes": ["col-12"],
66 "header": {
67 "classes": ["alert", "alert-secondary"]
68 }
69 }
70 })
71

Now, when we go to the command line and run

great_expectations docs build
, we will rebuild our documentation with the new renderer. When we edited our
great_expectations.yml
file, we changed the renderer for the
local site
, so we'll want to open that one up. When opening data docs, we will start on the index page, so we will click into an expectation suite (as we are editing an expectation suite renderer). And when we do, lo and behold! A shiny, new "hello world" section prepends our existing renderer.

hello_world

Being able to change the renderer's output is pretty exciting. But in its current format, our plugin isn't very useful. Let's change that by configuring it to display a table with our columns, and any column descriptions. In order to do this, we'll go into our

BasicDatasetProfiler.json
file. When we run
great_expectations init
, GE profiles our datasource in order to provide us with a template for specifying our own expectations. These profiles are saved
.json
files. Initially, these will be pretty generic, but as we fill in these profiles with our expectations, they will act as the source of truth from which our data dictionary will pull. Additionally, GE does a pretty good job of extracting data types, so we get that as a freebie. Let's open up
great_expectations/expectations/mips__dir/default/physician_mips/BasicDatasetProfiler.json
and take a look.

renderer_json

We can see that our

BasicDatasetProfiler.json
is divided into two main sections. There's a
"meta"
section which contains some metadata for our datasource like column names and column descriptions (your descriptions might be empty), as well as a broader
"notes"
section. Then there is our
"expectations"
section which is the bulk of this
.json
file. It contains all of the expectations themselves. We'll dive deeper into this part later on in this post.

For now, let's try extracting the column names and column descriptions - these will make up the first two columns of our data dictionary. If we look back at the code for the "hello world" renderer that we built, we have most of the building blocks of our renderer in our

content_blocks
list. So far, all we have is a header - let's try adding the empty table that will be our data dictionary. To do that, we'll add a new
_render_data_dictionary
function. This will return a new RenderedTableContent object that will get added to the
content_blocks
list in our
render
function.

1def render(self, expectations):
2 columns, ordered_columns = self._group_and_order_expectations_by_column(
3 expectations
4 )
5 expectation_suite_name = expectations.expectation_suite_name
6
7 overview_content_blocks = [
8 self._render_data_dictionary_header(expectations),
9 self._render_data_dictionary(expectations), #highlight-line
10 self._render_expectation_suite_header(),
11 self._render_expectation_suite_info(expectations),
12 ]
13
14 table_level_expectations_content_block = self._render_table_level_expectations(
15 columns
16 )
17 if table_level_expectations_content_block is not None:
18 overview_content_blocks.append(table_level_expectations_content_block)
19
20 asset_notes_content_block = self._render_expectation_suite_notes(expectations)
21 if asset_notes_content_block is not None:
22 overview_content_blocks.append(asset_notes_content_block)
23
24 sections = [
25 RenderedSectionContent(
26 **{
27 "section_name": "Overview",
28 "content_blocks": overview_content_blocks,
29 }
30 )
31 ]
32
33 sections += [
34 self._column_section_renderer.render(expectations=columns[column])
35 for column in ordered_columns
36 if column != "_nocolumn"
37 ]
38 return RenderedDocumentContent(
39 **{
40 "renderer_type": "ExpectationSuitePageRenderer",
41 "page_title": expectation_suite_name,
42 "expectation_suite_name": expectation_suite_name,
43 "utm_medium": "expectation-suite-page",
44 "sections": sections,
45 }
46 )
47def _render_data_dictionary(self, expectations): #highlight-line
48 return RenderedTableContent(**{ #highlight-line
49 "content_block_type": "table", #highlight-line
50 "header_row": ["Column Name", "Description"], #highlight-line
51 "header": "Data Dictionary", #highlight-line
52 "table": [], #highlight-line
53 "styling": { #highlight-line
54 "classes": ["col-12", "table-responsive"], #highlight-line
55 "styles": { #highlight-line
56 "margin-top": "20px", #highlight-line
57 "margin-bottom": "20px" #highlight-line
58 }, #highlight-line
59 "body": { #highlight-line
60 "classes": ["table", "table-sm"] #highlight-line
61 } #highlight-line
62 }, #highlight-line
63 }) #highlight-line
64

For now, we'll leave the

"table"
an empty list, and we'll specify our first two columns in the
"header_row"
("Column Name" and "Description"). Now when we rebuild our documentation (by running
great_expectations docs build
on the command line), you'll see the start of our new table with header, and header row.

headers_only

Now what we'll do is replace the empty list that is our table with a pandas DataFrame. Then we can add each column to the dataframe as we parse it from the

.json
, and slowly populate our data dictionary. Let's first build a function to get the table columns and descriptions. Because of the
meta
section in our
.json
file, this is a simple, one-line function.

1def _get_table_columns(self,expectations):
2 return expectations.meta.get("columns")
3

Next, we'll create the pandas DataFrame, and add our columns for Column Name and Description. Make sure that you're importing pandas at the top of the module. Then we'll use a list comprehension in conjunction with our

_get_table_columns
function to populate each of these columns. Also, let's change out the text in our header from "hello world" to "Data Dictionary." Lastly, be sure to replace your empty list in
table
with the values of your new DataFrame!

1def _render_data_dictionary(self, expectations):
2 data_dictionary_df = pd.DataFrame() #highlight-line
3 if self._get_table_columns(expectations): #highlight-line
4 data_dictionary_df["Column Name"] = [i for i in self._get_table_columns(expectations)] #highlight-line
5 data_dictionary_df["Description"] = [i["description"] for i in self._get_table_columns(expectations).values()] #highlight-line
6
7 return RenderedTableContent(**{
8 "content_block_type": "table",
9 "header_row": ["Column Name", "Description"],
10 "header": "Data Dictionary",
11 "table": data_dictionary_df.values, #highlight-line
12 "styling": {
13 "classes": ["col-12", "table-responsive"],
14 "styles": {
15 "margin-top": "20px",
16 "margin-bottom": "20px"
17 },
18 "body": {
19 "classes": ["table", "table-sm"]
20 }
21 },
22 })
23
24

When we rebuild our documentation, we can see our first two columns! At this point, you may have noticed that your "Description" column is empty. That's because the descriptions for each column in the "meta" section of our

BasicDatasetProfiler.json
are empty. And that is the idea here - as long as we keep our expectations up-to-date in the form of these
.json
files, we can count on our data dictionary to be up-to-date. So if we add some descriptions in the
.json
file, we'll start to fill in our data dictionary.

basic_dictionary

Now we can start parsing our column level expectations in order to populate some more columns in our data dictionary. We're going to be reaching into the

expectations
part of our
.json
file in order to do this. If you've perused the
.json
, you might have noticed that all of our expectations are at the same level - they are not organized under individual column sections. So first things first, we'll define a function that organizes our expectations by column. This will take in our expectations and use a filter to populate a dictionary where the keys are the names of our columns, and the values are a list of our expectations.

1def _sort_expectations_by_column(self, expectations):
2 expectations_by_column = {}
3 expectations_dictionary = expectations.expectations
4 column_names = list(self._get_table_columns(expectations).keys())
5 for column in column_names:
6 expectations_by_column[column] = list(filter(
7 lambda x: x.kwargs.get("column")==column,
8 expectations_dictionary))
9 return expectations_by_column
10

What column should we add next? Data types for each column would be useful, and if you specify that you would like column data typing during

suite scaffold
GE can do a great job of parsing our data source in order to guess the data type of each column. It saves this in the
expect_column_values_to_be_of_type
expectation. Here we'll create another function for our data types column. Note: If you didn't specify this expectation in scaffolding, you can create these typing expectations manually for each column. Additional note: Depending on your needs, you may find yourself with an
expect_column_values_to_be_in_type_list
expectation and you may need to modify this function slightly.

1def _get_column_data_types(self,expectations_by_column):
2 column_data_type_expectations = {}
3 for k,v in expectations_by_column.items():
4 expectation = [i for i in v if i["expectation_type"] == "expect_column_values_to_be_of_type"]
5 if len(expectation)>0:
6 type_list = expectation[0].kwargs.get("type_")
7 column_data_type_expectations[k] = type_list
8 else:
9 column_data_type_expectations[k] = None
10 return column_data_type_expectations
11

This will take in our sorted expectations by column, loop through each column, and if it has an

expect_column_values_to_be_of_type
expectation, it will add the
type_
from that expectation to a returned dictionary. We have some very basic error handling here in case we don't have an expectation for
expect_column_values_to_be_of_type
or in case the
type_
in the expectation is empty.

Using our

_get_column_data_types
function, we can add the next column to our data dictionary. In our
_render_data_dictionary
function, we'll need to add a new column to our
data_dictionary_df
, and then a new column name to the
"header_row"
list in the
RenderedTableContent
object. I also know that I'll be adding more columns from our expectations, so I'll save our sorted expectations by column to a variable.

1def _render_data_dictionary(self, expectations):
2 data_dictionary_df = pd.DataFrame()
3 data_dictionary_df["Column Name"] = [i for i in self._get_table_columns(expectations)]
4 data_dictionary_df["Description"] = [i["description"] for i in self._get_table_columns(expectations).values()]
5
6 expectations_by_column = self._sort_expectations_by_column(expectations) #highlight-line
7 data_dictionary_df["Data Type"] = [i for i in self._get_column_data_types(expectations_by_column).values()] #highlight-line
8
9 return RenderedTableContent(**{
10 "content_block_type": "table",
11 "header_row": ["Column Name", "Description", "Data Type"], #highlight-line
12 "header": "Data Dictionary",
13 "table": data_dictionary_df.values,
14 "styling": {
15 "classes": ["col-12", "table-responsive"],
16 "styles": {
17 "margin-top": "20px",
18 "margin-bottom": "20px"
19 },
20 "body": {
21 "classes": ["table", "table-sm"]
22 }
23 },
24 })
25

When we rebuild our documentation, we can see our new data type column!

final_dictionary

See the gist for all of the code we've written so far.

There are many more columns we can add based off of our expectation suite, like nullity, set values, mins and maxs. You can specify these columns when running

suite scaffold
or create these expectations manually. As you update your expectation suite, you can count on this data dictionary to stay current. If you add new expectations, like those for string length or regexs, you can add new columns to show those new attributes. And if you decide that you no longer want this data dictionary in your GE documentation, you can remove the plugin, or write a new one!

Final code for the plugin I built is here for reference but note that the blog doesn't cover everything!

Search our blog for the latest on data quality.


©2024 Great Expectations. All Rights Reserved.