The latest Great Expectations release has first-class support for FileDataAssets, with a new set of file-level expectations. Check them out!
Recently, a colleague posted the following message in one of our internal slack channels:
Imagine (it’s easy if you try) that you have a CSV file, whose header contains 269 columns, separated by the “|” sign. The following lines in this file are comprised of data columns, one data token per header, separated by the “|” sign. Now imagine that one of those data lines contains 272 data values, separated by the “|” sign. […] If you are wondering about why these numbers are so particular, it is because this example did not need to be imagined. It is real and just happened with [redacted to protect the innocent] data set.
It is easy if you try, isn’t it? Is it so much to ask that a CSV conforms to its own spec? Now, of course most of you are immediately thinking about how fun it would be to search for should-be-escaped pipes, how exciting it would be to guess the errors by building column models, or what a hero you would be after doing something even more exotic to coax the meaning from this malformed CSV.
I get it, I go there too. But…are we really asking so much here?
We at Great Expectations (GE) have seen and heard this problem so much that, thanks to the work of @anhollis and others, file-level expectations of just this type are now first-class citizens. The FileDataAsset type is easy to connect to (you guessed it) files. And keeping with the GE spirit of verbosity, you’ll now find these easy-to-understand expectations available right out of the box:
1expect_file_to_exist2expect_file_size_to_be_between3expect_file_hash_to_equal4expect_file_to_have_valid_table_header5expect_file_to_be_valid_json6expect_file_line_regex_match_count_to_be_between7expect_file_line_regex_match_count_to_equal8
With those expectations in place, we can easily bring more components of our pipeline under test starting with this release of GE.
Let’s replay my colleague’s story using the new FileDataAsset type.
is a pretty minimal place to start, but it could be useful if all we know is that we should get data in a, say, daily drop and we want to alert if the data never shows up (since many pipelines would just happily process all zero files in the staging area).
We believe that testing for simple things that cause silent failures later — for example no data loading because it was never delivered — is an important pattern in Great Expectations. So the most basic config is really simple:
1{2 "data_asset_name": "daily_delivery",3 "data_asset_type": "FileDataAsset",4 "meta": {5 "great_expectations.**version**": "0.5.1"6 },7 "expectations": [8 {9 "expectation_type": "expect_file_to_exist",10 "kwargs": {11 "filepath": {12 "$PARAMETER": "filename"13 }14 }15 }16 ]17}18
Notice that we’re using filename as an evaluation parameter.*
Clearly though, we have higher expectations than just those: let’s use
to ensure we have the right number of values on each line. Again, a common pattern in Great Expectations is to start broad, then refine expectations using annotations that describe data. Our first expectation config might look something like this:
1{2 "data_asset_name": "daily_delivery",3 "data_asset_type": "FileDataAsset",4 "meta": {5 "great_expectations.**version**": "0.5.1"6 },7 "expectations": [8 {9 "expectation_type": "expect_file_line_regex_match_count_to_be_between",10 "kwargs": {11 "regex": "\\|",12 "expected_min_count": 268,13 "expected_max_count": 26814 }15 }16 ]17}18
Now, the rows with “extra values” will be immediately flagged and available for inspection and quick refinement. Pivoting to a toy dataset, observe the following output:
1> asset.expect_file_line_regex_match_count_to_be_between(r'\|', 3, 3,2 result_format="SUMMARY")3> {'success': False,4 'result': {'element_count': 12,5 'missing_count': 2,6 'missing_percent': 0.16666666666666666,7 'unexpected_count': 2,8 'unexpected_percent': 0.16666666666666666,9 'unexpected_percent_nonmissing': 0.2,10 'partial_unexpected_list': ['B|"C|"|5|4\n', 'B|"why| C"|1|4\n'],11 'partial_unexpected_index_list': [4, 5],12 'partial_unexpected_counts': [{'value': 'B|"C|"|5|4\n', 'count': 1},13 {'value': 'B|"why| C"|1|4\n', 'count': 1}]}}14
This provides immediate feedback we can use to refine our pipeline — in this case perhaps by adjusting our treatment of quoted characters — and simultaneously update our expectation:
1> asset.expect*file_line_regex_match_count_to_equal(r'\|(?=([^"\\]*(\\.|"([^"\\]_\\.)_[^"\\]_"))_[^"]\_\$)', 3, meta={"notes": "We observed quoted pipes in about 15 percent of the first test dataset; for example ‘B|\”C|\”|5|4’"})2 {3 "data_asset_name": "daily_delivery",4 "data_asset_type": "FileDataAsset",5 "meta": {6 "great_expectations.__version__": "0.5.0"7 },8 "expectations": [9 {10 "expectation_type": "expect_file_line_regex_match_count_to_be_between",11 "kwargs": {12 "regex": "\\|",13 "expected_min_count": 3,14 "expected_max_count": 315 }16 },17 {18 "expectation_type": "expect_file_line_regex_match_count_to_equal",19 "kwargs": {20 "regex": "\\|(?=([^\"\\\\]*(\\\\.|\"([^\"\\\\]_\\\\.)_[^\"\\\\]_\"))_[^\"]\_\$)",21 "expected_count": 322 },23 "meta": {24 "notes": "We observed quoted pipes in about 15 percent of the first test dataset; for example \u2018B|\\\u201dC|\\\u201d|5|4\u2019"25 }26 }27 ]28 }29
Of course, that’s just the beginning (we just wrote a pretty ugly regex that already threatens to keep me up at night**). Keeping with the spirit of GE, we don’t expect that those expectations will cover all your needs, and we don’t expect that all quality issues will hit on your first round of expectations — but with checks like these in place you can more quickly zero in on problems and help convert tacit knowledge about datasets and data flows into explicit, testable statements.
Further, the FileDataAsset works like all the other core GE classes, making it easy to extend the logic with custom expectations that can still be easily understood, evaluated, and managed alongside other expectations relevant for your core business. For example, much like with the Dataset types, a FileDataAsset provides line-by-line parsing of text files, opening the map_expectation semantics to a wider range of logical locations in your data processing pipelines.
May your expectations always be fulfilled,
The Great Expectations Team
* we could also have omitted the parameter, and created a FileDataAsset in the usual way, with a path parameter.
** For those of you who, like me, hate parsing regexes yourself: we’re looking for | characters that are not in blocks surrounded by quotation marks (and handling escapes). If you decide to include such a regex in your expectations, I’d strongly encourage you to add a nice comment to your expectation’s meta element! And, please note that I gratefully borrowed that regex from this stackoverflow question.
*** Please feel free to ask questions or interact with us through Slack and/or [GitHub issues]
This post was originally posted on medium.