backgroundImage

ID/PK helps you find your problem data

Specify a row identifier or primary key & get back the rows that fail your Expectations

Will Shin
February 14, 2023
Screenshot of an example use of ID/PK as rendered in Data Docs
Row identifiers, if specified, now appear in Data Docs

If you’ve ever wished you could identify the exact rows that failed your Expectation: now you can!

You can now specify a row identifier or primary key (ID/PK) of a table and identify the rows that fail an Expectation by using a new key, unexpected_index_column_names, in the result_format parameter. This is a feature that a lot of people have expressed interest in, so we’re very happy to be able to show it off now.

This ID/PK work is a perfect example of the power of the open source community, and Great Expectations’ in particular.

It was a community member, KentonParton, who made the initial feature request for ID/PK on GitHub:

original ID/PK feature request

Original situation

Technically, this was possible in a limited way using Pandas’ built-in unexpected_index_list, but this had two major shortcomings. The first was that it could only use the default Pandas index; the second, more significant drawback was that it only worked if you were using Pandas.

You could also get some minimum information from Checkpoints. A Checkpoint would output the list of unexpected values and the number of times each unexpected value occurred. This told you what was wrong but not where it went wrong.

ID/PK development

This feature request proposed expanding that capability to allow users to specify their own index column and use any supported execution engine.

Shortly after KentonParton’s original post, OmarSultan85 chimed in to reinforce the use case for this feature:

ID/PK followup comment

Hearing from additional users that they want a given feature is a huge factor when the GX team is deciding whether to take action on a request and how to prioritize it.

There was a robust community discussion as the GX team scoped out this feature too. Many of the naming ideas that users suggested ultimately made their way into the final feature.

Community member Aidan Fennessy (abekfenn) took on the initial work of implementing this feature. 

After Aidan began the Pandas implementation, he passed the baton back to the GX team. I completed the Pandas, Spark, and SQLAlchemy implementations just a few weeks ago.

Today

Here’s how to start using ID/PK in your GX deployment.  For the sake of illustration, we will use the following example table throughout this blog post.

ID/PK example table

First, open up your result_format for the Checkpoint you want ID/PK in.

To specify the column that will be your row identifier, add an unexpected_index_column_names parameter. For the value, specify the list of column name(s) as a list of strings.

1result_format_dict: dict = {
2 "result_format": {
3 "result_format": "COMPLETE",
4 "unexpected_index_column_names": ["event_id"],
5 }
6}

You can pass the updated result_format back into your Checkpoint using run_checkpoint’s result_format parameter:

1results: CheckpointResult = context.run_checkpoint(
2 checkpoint_name="my_checkpoint",
3 result_format=result_format_dict
4)

 or by adding it to the Checkpoint configuration itself as a runtime_configuration

1name: my_checkpoint
2config_version: 1.0
3template_name:
4module_name: great_expectations.checkpoint
5class_name: Checkpoint
6run_name_template: '%Y-%M-foo-bar-template'
7expectation_suite_name: visitors_exp
8batch_request: {}
9action_list:
10 - name: store_validation_result
11 action:
12 class_name: StoreValidationResultAction
13 - name: store_evaluation_params
14 action:
15 class_name: StoreEvaluationParametersAction
16 - name: update_data_docs
17 action:
18 class_name: UpdateDataDocsAction
19evaluation_parameters: {}
20runtime_configuration: {}
21validations:
22 - batch_request:
23 datasource_name: my_datasource
24 data_connector_name: my_spark_data_connector
25 data_asset_name: visits.csv
26profilers: []
27ge_cloud_id:
28expectation_suite_ge_cloud_id:

Now, when you run the Expectation, its Validation Result will output a list of dictionaries in unexpected_index_list.  There will be one dictionary for each row that failed the Expectation, and each dictionary will contain the row’s identifier and the unexpected column value (the value that failed the Expectation).

Pandas result

Now that you know where your failed data is, you might want to go in and correct it or move it out of your production table. GX facilitates that by providing unexpected_index_query as another output of unexpected_index_column_names.

For SQL, the output will contain a query that can be used directly against the DB backend.

sql output query

For Spark, the output will include a filter condition on the DataFrame.

Spark output

And for Pandas, the query will include the indices.

Pandas query

Currently, the unexpected_query returns results at the COMPLETE output level. If you would like to suppress the query, then the return_unexpected_index_query parameter can be set to False.

1result_format_dict: dict = {
2 "result_format": {
3 "result_format": "COMPLETE",
4 "unexpected_index_column_names": ["event_id"],
5 "return_unexpected_index_query": False,
6 }
7}

Specifying multiple indices is simply a matter of adding the additional index column name to the unexpected_index_column_names parameter:

1result_format_dict: dict = {
2 "result_format": {
3 "result_format": "COMPLETE",
4 "unexpected_index_column_names": ["event_id", "visit_id"],
5 }
6}

The resulting output will contain indices for both event_id and visit_id.

Two indices output

In Pandas, we can have named indices. You can use ID/PK with those, too!

Pandas table

1result_format_dict: dict = {
2 "result_format": {
3 "result_format": "COMPLETE",
4 "unexpected_index_column_names": ["event_id"],
5 }
6}

Pandas result

When you build DataDocs,  the ID/PK information will be added to the Unexpected Counts table. The row identifier column will be an additional column on the Unexpected Counts table, with the list of indices shown as the columns below. This is what it looks like in our example, which has the row identifier column event_id:

Updated unexpected counts table

Below the table you’ll also see a dropdown menu labeled “To retrieve all unexpected values…”. 

When you click this, the table will display a way for you to retrieve the full list of unexpected values. This query will look slightly different for each execution engine.

The SQL query looks like:

Retrieve all unexpected values sql query

The Pandas list of indices looks like:

Retrieve all unexpected values list of Pandas indices

And the Spark filter looks like:

Retrieve all unexpected values Spark filter

And that’s ID/PK! You can watch my live demo of this feature on our YouTube.

If you’d like to view the full scripts used in this example, you can find them on GitHub.


This blog was written for GX version 0.15.50.

Search our blog for the latest on data quality.


©2024 Great Expectations. All Rights Reserved.