This post has been updated and works as of
. originally posted October 04, 2019
Recently, we've been thinking a lot about data dictionaries at Superconductive Health. Our clients can have a broad range of technical expertise, and some clients ask us for quick summaries of their data in data dictionary form - e.g. a summary of the metadata about each column in a table. With that frequent request in mind, I really appreciated Alex Jia and Michael Kaminsky's recent post, "You probably don't need a data dictionary." They made some salient points about the maintenance burden becoming a scalability issue for data dictionaries with the logical conclusion that if a data dictionary needs to be manually maintained, it's probably not worth the effort, as it will require maintaining multiple sources of truth for your data instead of relying on one source of truth. But what if a data dictionary were updated automatically based on your single data source of truth?
This seemed like a great candidate for a Great Expectations plugin. We're already using GE as our lone source of truth for data documentation, and with the amount of information already captured in GE, we wouldn't need to capture any new information - rather, we just need to provide a new view into the information that we already have. At the same time, this data dictionary feature might not provide critical functionality for every GE user, so rather than building it into the core project, building it out as a plugin seemed like the ideal way to proceed. Plugins are a simple way to add modular functionality to GE without altering the core. Running
to initialize a GE project automatically creates a plugins folder where you can drop bits of code to add or adjust functionality in GE. We can then adjust the project settings in our
file (also created upon project initialization) to tell GE to use our plugin instead of the base code.
Since our goal for building a data dictionary is to provide a new view into data that we already have, this falls into GE's rendering functionality - we don't need to process new information, but just to repackage the information we already have in a new way. For this example project, we're using the Overall Performance dataset which contains Merit-Based Incentive Payment System (MIPS) final scores and performance category scores for clinicians. You can export it as as a csv from the top of this page. As a reminder, this is what our basic expectation renderer looks like:
Our goal is to add a "Data Dictionary" section above the "Expectation Suite Overview" that will provide a table with some metadata about the columns of this CSV, like data type, nullability, possible values, etc. These columns will automatically populate based on our expectations, so as our data changes, the data dictionary will change without requiring manual maintenance. Let me say this one more time: Your data dictionaries will update automatically when your data documentation gets updated automatically - no manual intervention necessary. That solves the primary problem that Michael and Alex had identified with data dictionaries. Before we dive into the weeds, let's start by building a very simple plugin that adds a "Hello World" section to our already rendered documentation so that we can see what the plugin process is like. First, in our
file, under
then
, we will create a new key called
. Under this, we'll add a key called
, below that, a key called
, and below that we will add our
and
.
1local_site:2 class_name: SiteBuilder3 # set to false to hide how-to buttons in Data Docs4 show_how_to_buttons: true5 store_backend:6 class_name: TupleFilesystemStoreBackend7 base_directory: uncommitted/data_docs/local_site/8
9 site_section_builders: #highlight-line10 expectations: #highlight-line11 renderer: #highlight-line12 module_name: custom_data_docs.my_custom_renderer #highlight-line13 class_name: MyCustomExpectationSuitePageRenderer #highlight-line14
Next, we'll need to actually add the plugin. In the
directory, we'll create a file called
(to match the
from our
). At the top, we'll import the necessary pieces modules from core GE.
1from great_expectations.render.renderer import (2 ExpectationSuitePageRenderer3)4from great_expectations.render.types import (5 RenderedDocumentContent,6 RenderedSectionContent,7 RenderedHeaderContent,8 RenderedTableContent,9)10
Then we'll create a class that inherits from the pre-existing
, and overwrite the
function - we'll start by pasting in the original function. Then we'll add a call to
at the start of
. Finally we'll create the
function below. The goal here is to prepend our new "Hello World" section to our existing rendering functionality.
1import pandas as pd2
3from great_expectations.render.renderer import (4 ExpectationSuitePageRenderer5)6from great_expectations.render.types import (7 RenderedDocumentContent,8 RenderedSectionContent,9 RenderedHeaderContent,10 RenderedTableContent,11)12
13class MyCustomExpectationSuitePageRenderer(ExpectationSuitePageRenderer):14 def render(self, expectations):15 columns, ordered_columns = self._group_and_order_expectations_by_column(16 expectations17 )18 expectation_suite_name = expectations.expectation_suite_name19
20 overview_content_blocks = [21 self._render_data_dictionary_header(expectations), #highlight-line22 self._render_expectation_suite_header(),23 self._render_expectation_suite_info(expectations),24 ]25
26 table_level_expectations_content_block = self._render_table_level_expectations(27 columns28 )29 if table_level_expectations_content_block is not None:30 overview_content_blocks.append(table_level_expectations_content_block)31
32 asset_notes_content_block = self._render_expectation_suite_notes(expectations)33 if asset_notes_content_block is not None:34 overview_content_blocks.append(asset_notes_content_block)35
36 sections = [37 RenderedSectionContent(38 **{39 "section_name": "Overview",40 "content_blocks": overview_content_blocks,41 }42 )43 ]44
45 sections += [46 self._column_section_renderer.render(expectations=columns[column])47 for column in ordered_columns48 if column != "_nocolumn"49 ]50 return RenderedDocumentContent(51 **{52 "renderer_type": "ExpectationSuitePageRenderer",53 "page_title": expectation_suite_name,54 "expectation_suite_name": expectation_suite_name,55 "utm_medium": "expectation-suite-page",56 "sections": sections,57 }58 )59
60 def _render_data_dictionary_header(self, expectations):61 return RenderedHeaderContent(**{ # !!62 "content_block_type": "header",63 "header": "hello world",64 "styling": {65 "classes": ["col-12"],66 "header": {67 "classes": ["alert", "alert-secondary"]68 }69 }70 })71
Now, when we go to the command line and run
, we will rebuild our documentation with the new renderer. When we edited our
file, we changed the renderer for the
, so we'll want to open that one up. When opening data docs, we will start on the index page, so we will click into an expectation suite (as we are editing an expectation suite renderer). And when we do, lo and behold! A shiny, new "hello world" section prepends our existing renderer.
Being able to change the renderer's output is pretty exciting. But in its current format, our plugin isn't very useful. Let's change that by configuring it to display a table with our columns, and any column descriptions. In order to do this, we'll go into our
file. When we run
, GE profiles our datasource in order to provide us with a template for specifying our own expectations. These profiles are saved
files. Initially, these will be pretty generic, but as we fill in these profiles with our expectations, they will act as the source of truth from which our data dictionary will pull. Additionally, GE does a pretty good job of extracting data types, so we get that as a freebie. Let's open up
and take a look.
We can see that our
is divided into two main sections. There's a
section which contains some metadata for our datasource like column names and column descriptions (your descriptions might be empty), as well as a broader
section. Then there is our
section which is the bulk of this
file. It contains all of the expectations themselves. We'll dive deeper into this part later on in this post.
For now, let's try extracting the column names and column descriptions - these will make up the first two columns of our data dictionary. If we look back at the code for the "hello world" renderer that we built, we have most of the building blocks of our renderer in our
list. So far, all we have is a header - let's try adding the empty table that will be our data dictionary. To do that, we'll add a new
function. This will return a new RenderedTableContent object that will get added to the
list in our
function.
1def render(self, expectations):2 columns, ordered_columns = self._group_and_order_expectations_by_column(3 expectations4 )5 expectation_suite_name = expectations.expectation_suite_name6
7 overview_content_blocks = [8 self._render_data_dictionary_header(expectations),9 self._render_data_dictionary(expectations), #highlight-line10 self._render_expectation_suite_header(),11 self._render_expectation_suite_info(expectations),12 ]13
14 table_level_expectations_content_block = self._render_table_level_expectations(15 columns16 )17 if table_level_expectations_content_block is not None:18 overview_content_blocks.append(table_level_expectations_content_block)19
20 asset_notes_content_block = self._render_expectation_suite_notes(expectations)21 if asset_notes_content_block is not None:22 overview_content_blocks.append(asset_notes_content_block)23
24 sections = [25 RenderedSectionContent(26 **{27 "section_name": "Overview",28 "content_blocks": overview_content_blocks,29 }30 )31 ]32
33 sections += [34 self._column_section_renderer.render(expectations=columns[column])35 for column in ordered_columns36 if column != "_nocolumn"37 ]38 return RenderedDocumentContent(39 **{40 "renderer_type": "ExpectationSuitePageRenderer",41 "page_title": expectation_suite_name,42 "expectation_suite_name": expectation_suite_name,43 "utm_medium": "expectation-suite-page",44 "sections": sections,45 }46 )47def _render_data_dictionary(self, expectations): #highlight-line48 return RenderedTableContent(**{ #highlight-line49 "content_block_type": "table", #highlight-line50 "header_row": ["Column Name", "Description"], #highlight-line51 "header": "Data Dictionary", #highlight-line52 "table": [], #highlight-line53 "styling": { #highlight-line54 "classes": ["col-12", "table-responsive"], #highlight-line55 "styles": { #highlight-line56 "margin-top": "20px", #highlight-line57 "margin-bottom": "20px" #highlight-line58 }, #highlight-line59 "body": { #highlight-line60 "classes": ["table", "table-sm"] #highlight-line61 } #highlight-line62 }, #highlight-line63 }) #highlight-line64
For now, we'll leave the
an empty list, and we'll specify our first two columns in the
("Column Name" and "Description"). Now when we rebuild our documentation (by running
on the command line), you'll see the start of our new table with header, and header row.
Now what we'll do is replace the empty list that is our table with a pandas DataFrame. Then we can add each column to the dataframe as we parse it from the
, and slowly populate our data dictionary. Let's first build a function to get the table columns and descriptions. Because of the
section in our
file, this is a simple, one-line function.
1def _get_table_columns(self,expectations):2 return expectations.meta.get("columns")3
Next, we'll create the pandas DataFrame, and add our columns for Column Name and Description. Make sure that you're importing pandas at the top of the module. Then we'll use a list comprehension in conjunction with our
function to populate each of these columns. Also, let's change out the text in our header from "hello world" to "Data Dictionary." Lastly, be sure to replace your empty list in
with the values of your new DataFrame!
1def _render_data_dictionary(self, expectations):2 data_dictionary_df = pd.DataFrame() #highlight-line3 if self._get_table_columns(expectations): #highlight-line4 data_dictionary_df["Column Name"] = [i for i in self._get_table_columns(expectations)] #highlight-line5 data_dictionary_df["Description"] = [i["description"] for i in self._get_table_columns(expectations).values()] #highlight-line6
7 return RenderedTableContent(**{8 "content_block_type": "table",9 "header_row": ["Column Name", "Description"],10 "header": "Data Dictionary",11 "table": data_dictionary_df.values, #highlight-line12 "styling": {13 "classes": ["col-12", "table-responsive"],14 "styles": {15 "margin-top": "20px",16 "margin-bottom": "20px"17 },18 "body": {19 "classes": ["table", "table-sm"]20 }21 },22 })23
24
When we rebuild our documentation, we can see our first two columns! At this point, you may have noticed that your "Description" column is empty. That's because the descriptions for each column in the "meta" section of our
are empty. And that is the idea here - as long as we keep our expectations up-to-date in the form of these
files, we can count on our data dictionary to be up-to-date. So if we add some descriptions in the
file, we'll start to fill in our data dictionary.
Now we can start parsing our column level expectations in order to populate some more columns in our data dictionary. We're going to be reaching into the
part of our
file in order to do this. If you've perused the
, you might have noticed that all of our expectations are at the same level - they are not organized under individual column sections. So first things first, we'll define a function that organizes our expectations by column. This will take in our expectations and use a filter to populate a dictionary where the keys are the names of our columns, and the values are a list of our expectations.
1def _sort_expectations_by_column(self, expectations):2 expectations_by_column = {}3 expectations_dictionary = expectations.expectations4 column_names = list(self._get_table_columns(expectations).keys())5 for column in column_names:6 expectations_by_column[column] = list(filter(7 lambda x: x.kwargs.get("column")==column,8 expectations_dictionary))9 return expectations_by_column10
What column should we add next? Data types for each column would be useful, and if you specify that you would like column data typing during
GE can do a great job of parsing our data source in order to guess the data type of each column. It saves this in the
expectation. Here we'll create another function for our data types column. Note: If you didn't specify this expectation in scaffolding, you can create these typing expectations manually for each column. Additional note: Depending on your needs, you may find yourself with an
expectation and you may need to modify this function slightly.
1def _get_column_data_types(self,expectations_by_column):2 column_data_type_expectations = {}3 for k,v in expectations_by_column.items():4 expectation = [i for i in v if i["expectation_type"] == "expect_column_values_to_be_of_type"]5 if len(expectation)>0:6 type_list = expectation[0].kwargs.get("type_")7 column_data_type_expectations[k] = type_list8 else:9 column_data_type_expectations[k] = None10 return column_data_type_expectations11
This will take in our sorted expectations by column, loop through each column, and if it has an
expectation, it will add the
from that expectation to a returned dictionary. We have some very basic error handling here in case we don't have an expectation for
or in case the
in the expectation is empty.
Using our
function, we can add the next column to our data dictionary. In our
function, we'll need to add a new column to our
, and then a new column name to the
list in the
object. I also know that I'll be adding more columns from our expectations, so I'll save our sorted expectations by column to a variable.
1def _render_data_dictionary(self, expectations):2 data_dictionary_df = pd.DataFrame()3 data_dictionary_df["Column Name"] = [i for i in self._get_table_columns(expectations)]4 data_dictionary_df["Description"] = [i["description"] for i in self._get_table_columns(expectations).values()]5
6 expectations_by_column = self._sort_expectations_by_column(expectations) #highlight-line7 data_dictionary_df["Data Type"] = [i for i in self._get_column_data_types(expectations_by_column).values()] #highlight-line8
9 return RenderedTableContent(**{10 "content_block_type": "table",11 "header_row": ["Column Name", "Description", "Data Type"], #highlight-line12 "header": "Data Dictionary",13 "table": data_dictionary_df.values,14 "styling": {15 "classes": ["col-12", "table-responsive"],16 "styles": {17 "margin-top": "20px",18 "margin-bottom": "20px"19 },20 "body": {21 "classes": ["table", "table-sm"]22 }23 },24 })25
When we rebuild our documentation, we can see our new data type column!
See the gist for all of the code we've written so far.
There are many more columns we can add based off of our expectation suite, like nullity, set values, mins and maxs. You can specify these columns when running
or create these expectations manually. As you update your expectation suite, you can count on this data dictionary to stay current. If you add new expectations, like those for string length or regexs, you can add new columns to show those new attributes. And if you decide that you no longer want this data dictionary in your GE documentation, you can remove the plugin, or write a new one!
Final code for the plugin I built is here for reference but note that the blog doesn't cover everything!