Choose Your Adventure: Exploring Great Expectations Datasources and Batch Kwargs

Datasources make it possible to interact with data and compute environments together;

Great Expectations

March 25, 2022

Never miss a blog

Choose Your Adventure GX

Great Expectations is getting better in the 0.9.0 release, in large part because of much more flexibility in how to interact with Datasources. The old system of 'datasource/generator/generator_asset' left many people feeling trapped.

We realized that there was just too much friction to matching the way data flowed through different pipelines...and it was too hard to reuse expectation suites (but more on that another time).

That allowed us to revise the Datasource concept completely so it was no longer a one-size-fits-all approach. In this blog post, you'll get to choose a variety of paths corresponding to different ways that people use Great Expectaions. Reading through--or even following along!--should help build familiarity with Datasources and Batch Kwargs, and help understand when and how you might use a generator.

Oh, and there's a TL;DR that inspired the whole thing:

Preparation

I'll use data from the "titanic" dataset, available here, and have saved the file to the path

/opt/data/titanic/Titanic.csv

. (But it should be easy to see where to modify paths if that's not where you save it).

I assume you have installed Great Expectations using

pip install great_expectations

, and have a notebook available to run the code blocks. If you're having trouble following along with any of this but want to, hit me up on slack!

With the preliminaries out of the way, we'll start as all good Choose-Your-Own-Adventure stories ought...

The Beginning

You awake to find yourself at a keyboard. You remember that you have recently had a "data horror story" -- another team changed the schema for data they were sending you, and it subtly corrupted a dashboard. It took two weeks for people to notice, then created a cancel-your-weekend firedrill for the team.

Determined not to let it happen again, you are excited to implement GREAT EXPECTATIONS to crush the pipeline debt that enabled the problem. Let's fire up a GE environment.

1import great_expectations as ge
2print(ge.__version__)
3#
4import os
5import datetime
6#
7project_dir = "./ge_context/choose_your_adventure"
8os.makedirs(project_dir)
9context = ge.data_context.DataContext.create(project_dir)
10#
11pandas = context.add_datasource(
12    "pandas",
13    class_name="PandasDatasource",
14)

With the datasource in hand, it's time to decide how you want to proceed...

if you want to create expectations for a new dataset, go to Building a Suite
if you want to use an existing expectation suite, go to Beginning Validation

Building a Suite

Determined to create excellent expectations, you realize you need a data sample to get started. How do you want to get your sample data?

if you want to be presented with options from your datasource, go to Inspecting a Datasource
if you want to specify the data to use yourself, go to Describing Your Batch

Beginning Validation

You are ready to validate data...but just what data? Let's find out:

if your data is already loaded into your compute environment and you have a reference to it, go to Data Already Loaded in Compute. That could include an already in-memory Pandas DataFrame, a Spark DataFrame, or a table on a database.
if you would like Great Expectations to help you load data from a storage environment to a compute environment, choose Load Data

Data Already Loaded in Compute

You proceed ahead and blow dust away from the sign above the doorframe to see the room is labeled "compute validation". If you are using a SQL database, your compute and your storage are often commingled concepts. But not necessarily -- some of the most innovative technologies in databasing, such as Snowflake, provide additional distinction. And the reverse is also true: some of the most innovative technologies in distributed compute, such as Spark + Delta provide SQL-like interfaces and database-like guarantees.

It can be useful to be able to name data assets that you're working with, but you don't have to.

If you want to name data assets, proceed to Storing BatchKwargs in Configuration

Otherwise, proceed to Describing Your Batch

Storing BatchKwargs in Configuration

'Streamlined code' and 'all data-related references in configuration' are your middle names, you remind yourself as you gear up for this next challenge. You add the following to the redshift datasource in your great_expectations.yml.

1generators:
2  manual:
3    class_name: ManualGenerator
4    assets:
5      titanic:
6        path: /opt/data/titanic/Titanic.csv
7        reader_method: read_csv
8        reader_options:
9          index_col: 0

Now, you can reload the context, and retrieve those stored (reusable) batch_kwargs using the configured

name

. (We call name a batch_parameter if we want to get precise--it's input to the generator that is used to build batch_kwargs).

1context = ge.data_context.DataContext(os.path.join(project_dir, "great_expectations"))
2batch_kwargs = context.build_batch_kwargs("pandas", "manual", "titanic")
3#
4# What expectation suite shall we use? Why, the "adventure" suite of course:
5expectation_suite_name = "adventure"
6#
7# Demo Mode? Uncomment the below first to use an empty validation suite
8# suite = context.create_expectation_suite(expectation_suite_name)

Now, continue on to Getting A Batch of Data

Load Data

You step towards the well of Great Expectations' datasources load data functionality, and immediately realize you face another choice. Do you already know how to describe the data you want to load using Batch Kwargs that your configured datasource can use?

If so, proceed directly to Describing Your Data

If not, peer deeper into the well...

Peering into the well, you see the opportunity to fetch the Batch Kwargs from the deep using parameters to describe what you want; you could describe a filepath and a number of rows, for example.

Can you desribing your data using Batch Parameters of that sort?

If yes, we'll provide a table name, but will LIMIT it to a smaller number of rows. Proceed to Adding a Generator

If no, proceed to THE WELL

Inspecting a Datasource

Unsure what data your datasource and generator working together can provide (perhaps you're in the INIT flow?) you hope to see its true might using the AWESOME POWER of the list_data_assets command. Great Expectations will use a generator to inspect and understand the data available in your database.

Continue to Adding a Generator

Adding a Generator

1subdir_reader = context.add_generator("pandas", "subdir_reader", class_name="SubdirReaderGenerator")

If you are using the generator to inspect a datasource, proceed to Listing Assets.

If you are using the generator to build batch kwargs from parameters, continue to Building Batch Kwargs With Parameters

Building Batch Kwargs with Parameters

You know you want to use "titanic" data, but you want GE to help you get the specific data into your compute environment by finding a batch of the data, and reading only the first 10 rows. You can declare your wish by semantic batch parameters that will work with any Great Expectations datasource (which brings compute and storage together).

1batch_kwargs = context.build_batch_kwargs(
2    datasource="pandas",
3    generator="subdir_reader",
4    name="titanic",
5    limit=10
6)
7batch_kwargs
8#
9# What expectation suite shall we use? Why, the "adventure" suite of course:
10expectation_suite_name = "adventure"
11#
12# Demo Mode? Uncomment the below first to use an empty validation suite
13# suite = context.create_expectation_suite(expectation_suite_name)

With your batch_kwargs in hand, proceed to Getting a Batch of Data

Listing Assets

With the power of the generator, it is possible to list available data assets, as they are identified and defined by the generator, rather than only doing so manually.

Your new BatchKwargsGenerator knows how to inspect the redshift datasource and provide batch_kwargs based on table names.

To start, you need to list available data asset names.

1context.get_available_data_asset_names()

(You can use the jupyter_ux module to help format this output if needed, or wrap it into a script or node in a bigger DAG).

The context observes that the 'pandas' datasource's 'subdir_reader' generator knows of the listed names, each with the listed type.

One of those names is what we want (ok, in the demo there's probably only one showing up for you...) Then, having selected a dataasset to use,_as defined by the generator you configured, the generator can also build batch_kwargs describing a batch of data from that asset. While BATCH_KWARGS define only one batch at a time, it is possible that that batch represents the entire dataset (for a static dataset). Further, different BatchKwargsGenerators could provide different options for how to build BATCH_KWARGS, such as by allowing you to specify a limit.

1selected_data_asset_name = "Titanic"
2#
3batch_kwargs = context.build_batch_kwargs("pandas", "subdir_reader", selected_data_asset_name)
4batch_kwargs
5#
6# What expectation suite shall we use? Why, the "adventure" suite of course:
7expectation_suite_name = "adventure"
8#
9# Demo Mode? Uncomment the below first to use an empty validation suite
10# suite = context.create_expectation_suite(expectation_suite_name)

Continue to Getting a Batch of Data

Describing Your Batch

You scoff at those who don't know the data warehouse by heart; you laugh in the face of misspecified table names. You know just the data you want to use. Bring on the challenge. You decide to use tackle the LEGENDARY titanic dataset by...typing the path.

1# To describe the data, you need to specify a table name (a query would also do, but why complicate things yet?)
2batch_kwargs = {
3    "path": "/opt/data/titanic/Titanic.csv"
4}
5#
6# If you want to specify additional options, for example telling pandas to use the first column as an index, you can add reader_options.
7#
8batch_kwargs = {
9    "path": "/opt/data/titanic/Titanic.csv",
10    "reader_options": {
11        "index_col": 0
12    }
13}
14#
15# With this description in hand, the time has come to create an expectation suite to hold the new expectations.
16expectation_suite_name = "adventure"
17# Demo Mode? Uncomment the below first to use an empty validation suite
18# suite = context.create_expectation_suite(expectation_suite_name)

With your batch_kwargs in hand, proceed to Getting a Batch of Data

Getting a Batch of Data

With specific batch_kwargs in hand, you are ready to obtain your data and crush some pipeline debt. Continue through the gem-encrusted door to find your data...

1batch = context.get_batch(batch_kwargs=batch_kwargs, expectation_suite_name=expectation_suite_name)

If you are in the process of building a suite, continue on to Adding and Saving Expectations

If you are validating, proceed to Validate Your Batch

Adding and Saving Expectations

With sample data in hand, it's time to add some expectations...

There is no "right" set of expectations; if you chose community replication data, but you might consider these...

1batch.expect_column_to_exist("Survived")
2batch.expect_column_values_to_be_between("Age", 18, 80, mostly=0.9)
3batch.expect_column_values_to_not_be_null("Name")
4batch.expect_column_values_to_be_in_set("Sex", ["male", "female"])
5batch.expect_column_values_to_be_in_set("PClass", ["1st", "2nd", "3rd"], mostly=0.999)
6#
7# Don't forget to save your new suite!
8batch.save_expectation_suite()

Congratulations, friend, you have succeeded! The quest is yours, and the rights that come with its completion. Save your suite, and return to the beginning when you are ready to validate data. Please return to the beginning if you would like to explore a different path.

Validate Your Batch

The time has come to end your quest and validate your batch of data.

1result = context.run_validation_operator("action_list_operator", [batch])

Hail! You have completed your quest. Please return to the beginning if you would like to explore a different path.

THE WELL

You strive hard for your data. You peer deeper and deeper into it. Until...YOU FALL IN. Sorry, you don't have enough information to use GE yet--you'll need to understand how your data is stored or what you want to validate!

Cheers!