Bad data in your machine learning model is bad news–which is why data practitioners spend almost 40% of their time on data prep and cleansing. But that manual strategy doesn’t scale when an ML model needs a steady flow of high-quality data.
So how can data scientists make sure that their work isn’t undermined by undetected data quality problems?
One easy way: by integrating Great Expectations with ZenML.
Data quality with centralized MLOps
Machine learning pipelines typically require several standard steps.
First, the data needs to be annotated. Right off the bat, there can be additional complications: often the labeling data for training and testing is itself poorly-formatted. And once annotated, the data has to be programmatically cleaned to remove nulls, misformatted cells, and similar issues.
Data in hand, data scientists can’t just deploy a model and call it a day. They need to find the right model, tune it, and continuously monitor it. Finally, all that training, testing, and monitoring has to be orchestrated on a schedule.
If this all sounds like a lot to you, you’re not alone. That basic outline of MLOps has five separate steps: data annotation, cleaning, modeling, monitoring, and orchestration. In practice, MLOps has even more nuance—read more about MLOps here.
On top of that, all these steps need to run on malleable cloud infrastructure. And if data is being passed between the different services without any oversight, many things can silently go wrong.
ZenML is a MLOps framework that tackles this problem by unifying previously-disparate steps. For data scientists, this means a better understanding of how data is being moved and used.
Using the ZenML framework opens up the door for first-class data quality unit testing with Great Expectations.
Great Expectations within ML pipelines
Great Expectations allows data engineers and data scientists alike to write unit tests for their data at any step of a pipeline. Tests are expressed as Expectations; a failure during Validation of the Expectations means there’s an issue with the quality of the data.
In centralized ML pipelines, a failed Validation can trigger notifications. Since models are constantly being retrained using new data, those alerts are a major improvement over a completely unmonitored pipeline.
Integrating GX with ZenML brings GX’s data quality into the MLOps framework itself. ZenML can trigger GX Validations at any point, allowing data scientists to place GX Validations in close proximity to all their essential tools.
To get started, set up your Great Expectations context with ZenML as outlined in ZenML's documentation. Use GX statistical profiles early in your model-building process to understand the training dataset and set your expectations (and your Expectations).
During ongoing model retraining, configure GX Validations to run on an ongoing basis. This helps you ensure that new data won’t be bad data.
Get started with Great Expectations and ZenML
Integrate your Great Expectations setup with ZenML using our Great Expectations + ZenML architecture documentation!
Or, to start fresh with Great Expectations, check out our GX Welcome page.