
Does GX work on streaming data?
This is a question we get a lot. Usually, it’s because people are wondering whether GX can do anything meaningful with new, incoming data that arrives frequently and in relatively small amounts.
The short answer to that question is: yes.
The medium answer is: yes. More importantly, though, GX makes it possible to do historical batch-based validation that picks up deeper types of insights (even from streaming data).
The long answer is: this blog.
Validation on streaming data
Setting the stage
Often when people talk about ‘validating streaming data’ they mean examining the data on a record-by-record basis as soon as it arrives.
This is sometimes called ‘schema data validation,’ but often it’s just ‘data validation’—which ultimately does data engineers a disservice, because it can imply an artificial narrowing of data validation’s scope, when it should be a wide-ranging practice.
On-arrival individual validation can detect overt problems in single records, like an unexpectedly empty field or a value outside of an expected range. However, there’s a lot of aspects of data quality that it doesn’t address.
But let’s set aside that discussion for a moment to look at strategies for the on-arrival aspect of data validation aka schema data validation.
You can do on-arrival data validation for streaming data so it’s truly one-record-at-a-time. But in many scenarios, writing each event individually is cost-prohibitive, so even records in your ‘real-time’ pipelines are microbatched by, for example, buffering the events to S3 and then writing them to a warehouse together.
If you’re using the microbatch approach for your streaming data, GX can handle the basic on-arrival validation in addition to your more comprehensive data quality. If your streaming data is really having those first checks applied separately to each record, you should choose another tool to handle the on-arrival validation.
Either way, it’s vital that your data validation doesn’t end with on-arrival. It’s this post-arrival validation where GX really shines.
Here’s what it is and why you need it.
Beyond the schema
Since schema-based validation looks at each record individually—isolated from all the other records—it can’t catch issues like:
A record fails to arrive.
The data arrives at a rate that’s different from what you expected.
The rate of the data’s arrival is changing (or not changing) differently from what you expected.
The rate at which data fails the schema-based validation is different from what you expected.
Schema validation of individual records also can’t catch more subtle problems, such as:
Distribution of the data within the permitted range is different from what you expected.
Data that arrives out of order.
Data that arrives earlier or later than you expected.
Data that is more similar or more different over time than you expected.
You can only detect problems like those above (and many others) if you validate a record in context with other records of the same kind. This requires data to accumulate to some degree, so in the context of this blog, streaming data refers to any data that was generated by a streaming source, even when the system where it accumulates is not a streaming system itself.
Data quality with context
The word ‘batch’ has a lot of negative baggage attached, but for data validation in GX, you can throw that baggage overboard.
Start thinking batches as in cookies—flexible, scalable, with a recipe you can tweak depending on exactly what you want. Sometimes turning out a bunch of small batches quickly is what you need, and sometimes you want one really big one. Making batches is a normal, routine thing you do on an ongoing basis.
A batch in GX just means identifying the subset of your records that you want to run a given set of tests on. We expect you to have a lot of batches, batches that serve different purposes, and both very small batches and very large batches.
The same record can (and probably will) be in multiple batches, so we can’t take the cookie analogy too far, but hopefully you have the idea. Batching in GX uses exactly the same nimble, flexible, iterative approach as our testing does.
The context that batches provide is essential to checking for a huge range of data quality issues. Here’s the proof.
Consider the rocket launch
We’re going to use data from a public API about historical rocket launches: Launch List from The Space Devs. 🚀 Each API call retrieves the data for a single launch.
Looking at the record for the space shuttle Atlantis’ first launch, you can see that these are really complex records, with a lot of nesting and variable column counts:
1...2
3 "name": "Space Shuttle Atlantis / OV-104 | STS-51-J",4 "status": {5 "id": 3,6 "name": "Launch Successful",7 "abbrev": "Success",8 "description": "The launch vehicle successfully inserted its payload(s) into the target orbit(s)."9 },10
11...12
13 "launch_service_provider": {14 "id": 192,15 "url": "https://ll.thespacedevs.com/2.2.0/agencies/192/",16 "name": "Lockheed Space Operations Company",17 "featured": false,18 "type": "Commercial",19 "country_code": "USA",20 "abbrev": "LSOC",21
22...23
24"launcher_stage": [],25 "spacecraft_stage": {26 "id": 64,27 "url": "https://ll.thespacedevs.com/2.2.0/spacecraft/flight/64/",28 "mission_end": "1985-10-07T17:00:00Z",29 "destination": "Low Earth Orbit",30 "launch_crew": [31 {32 "id": 1241,33 "role": {34 "id": 1,35 "role": "Commander",36 "priority": 037 },38 "astronaut": {39 "id": 327,40 "url": "https://ll.thespacedevs.com/2.2.0/astronaut/327/",41 "name": "Karol J. Bobko",42 "type": {43 "id": 2,44 "name": "Government"45 },46 "in_space": false,47 "time_in_space": "P16DT2H2M53S",48 "status": {49 "id": 2,50 "name": "Retired"51
52...53
54"spacecraft": {55 "id": 39,56 "url": "https://ll.thespacedevs.com/2.2.0/spacecraft/39/",57 "name": "Space Shuttle Atlantis",58 "serial_number": "OV-104",59 "is_placeholder": false,60 "in_space": false,61 "time_in_space": "P307DT11H44M20S",62 "time_docked": "P118DT7H2M9S",63 "flights_count": 33,64 "mission_ends_count": 33,65 "status": {66 "id": 2,67 "name": "Retired"68 },69
70...71
72"landing": {73 "id": 933,74 "attempt": true,75 "success": true,76 "description": "Space Shuttle Atlantis successfully landed at Edwards Air Force Base after this mission.",77 "downrange_distance": null,78 "location": {79 "id": 23,80 "name": "Edwards Air Force Base",81
82...
These records are a reasonable stand-in for streaming data (or streamed data, if you prefer that once it’s in another system) because they’re event-based and originate via API call—so even though rocket launches don’t happen nearly as often as, say, real-time consumer purchase data, the general scenario and type of issues that can occur are very analogous.
To see our complete code, including the data retrieval and setup, you can follow along and/or download the ipynb file from this gist page.
Batching for contextualized data quality
Expect launch locations to be known
Let’s start with an example of a batch validation that’s similar in spirit, but different in implication, from an as-it-arrives check.
1# This class defines the Expectation itself2class ExpectLaunchLocationToBeKnown(ColumnMapExpectation):3 """Expect values in this column to be known launch locations as provided."""4 map_metric = "launch_location.is_known"5
6 success_keys = ("column", "location_ids", "mostly",)7