Community roundup: November 2022

November’s community meetup included monitoring data lake health at scale, an Airflow operator update, a new team member welcome, & more!

Erin Kapp

November 17, 2022

Never miss a blog

Great Expectations community roundup November 2022 "Monitoring data lake health at scale by Davide Romano" cover card

We get the community together on the third Tuesday of every month.

At meetups, we discuss the Great Expectations roadmap, watch ecosystem integration demos, and explore different ways data leaders have implemented Great Expectations. Sign up here to join the next one!

Now, let's dive into the roundup.

We covered

A Great Expectations product roadmap update
Welcoming GX’s new Director of Developer Relations, Josh Zheng
GX Airflow operator improvements
Monitoring data lake health at scale, from Davide Romano

Roadmap update

Recently: Data Assistants launched on September 20

The first Data Assistant is the Onboarding Assistant. Data Assistants replace the UserConfigurable Profiler. They’re new, so be sure to give us your feedback!

Current focus: Making it easier to get started

In terms of features, we’re working on:
- Improving DataContexts with better doc strings and more consistent save behavior.
- A new approach to data sources. Watch an overview in last month’s update.
We’re also improving the developer experience:
- We have new goals for responding to contributor PRs, and aim to respond within 24 hours and merge within a week (depending on scale). An important note on this: if we close out your PR due to inactivity, don’t hesitate to reopen it when you’re next able to continue the conversation!
- We’re also focusing on creating API documentation that’s on par with other API docs that we love, like Pandas’ and SciPy’s.

Up next: SDKs

Improved API docs, automated tests and test environments, a well-defined publishing experience, and more.

Watch the roadmap presentation:

Welcome

We are thrilled to welcome Josh Zheng, GX’s new Director of Developer Relations!

Josh is excited to meet community members—don’t hesitate to DM him @Josh Zheng on the GX Slack, and if you’re in the Bay Area, he’d love to buy you a coffee and talk GX in person.

Feature update

Exciting news about the Airflow operator: in collaboration with the Astronomer team, we’ve officially turned ownership of GX’s Airflow operator repository over to the Astronomer team for long-term maintenance with their other operators.

An announcement is forthcoming about a new version, but the updates will make it easier for new GX users and add improved support for OpenLineage.

Check out the video for additional comments from Benji Lampel, Airflow Engineering Advocate at Astronomer.

We appreciate the Astronomer team’s commitment to ensuring a best-in-class integration with GX!

Community demo

This month we had a great presentation from Davide Romano, a data scientist at Mediaset.

Watch the presentation here:

Or read on for a summary!

Background

Mediaset is a leading private TV publisher in Italy and Spain that built a data lake with 100GB of incoming data daily. As data volumes and daily queries increased, Mediaset advanced to actively ensuring their data lake health. They used Great Expectations and Spark to build a data quality framework.

Needs

Mediaset had four critical data quality needs:

App event tracking verification
ETL pipeline unit tests extension
Pipeline and services monitoring
Data drift detection

Workflow

Mediaset has a five-step workflow:

A Docker image running a Jupyter notebook and GX is used to develop Expectation Suites, including Custom Expectations. Custom Expectations are key to Mediaset’s GX implementation.
Once complete, Expectation Suites are committed to Mediaset’s git repository via a CI pipeline. The Expectation Suites, Custom Expectations, and Data Docs are stored in S3.
Airflow executes the data quality DAG.
Apache Superset is used to visualize results and send alerts.
GX’s Data Docs are published in Mediaset’s data documentation. This is key for sharing table checks with Mediaset’s nontechnical team members.

Mediaset’s data quality evaluation process

Davide explained how Mediaset incorporates insight from its nontechnical domain experts with a feedback loop between domain experts and GX experts to fully define the data quality requirements before GX is used.

Davide also answered a few questions from the audience. Click here to skip to the bonus/Q&A portion of the talk:

And the conversation continued in the GX Slack.

Thank you Davide for sharing your experience!

Read Davide’s article on Medium, and see the hands-on repo at Github.

Join the data quality conversation

davor korman had a question about keeping Data Docs from being overwritten on S3.
Gabriel Ferreira shared his chat with Martin Sahlen about why data lineage is hard.
Heather looked for advice about using batch_filter_params with split_on_year_and_month_and_day.
Brendan Bull asked about improving cluster utilization during concurrent Validations on Databricks: weigh in at Github!
Antonio Calvacante wrote a blog post about using GX to validate Delta Lake tables.

Additional updates

Get the invite for December’s meetup here.
GX now has CLI support for Trino (thanks, hovaesco!), plus some additional bug fixes and doc improvements in the latest release, v0.15.32.
We have an official partnership with Snowflake!
Check out our tips on using Boto3 to Assume Role with AWS, passing Spark configuration parameters, and specifying nonstandard delimiters for your CSV files.
We’re hiring in product and engineering! Check out our open roles.

Have you done something cool with Great Expectations that you'd like to share? If you're interested in demoing or have a piece of data quality content that you'd like us to feature, DM @Kyle Eaton on our Slack.