backgroundImage

Community roundup: November 2022

November’s community meetup included monitoring data lake health at scale, an Airflow operator update, a new team member welcome, & more!

Erin Kapp
November 17, 2022
Great Expectations community roundup November 2022 "Monitoring data lake health at scale by Davide Romano" cover card

We get the community together on the third Tuesday of every month. 

At meetups, we discuss the Great Expectations roadmap, watch ecosystem integration demos, and explore different ways data leaders have implemented Great Expectations. Sign up here to join the next one! 

Now, let's dive into the roundup.

We covered

  • A Great Expectations product roadmap update

  • Welcoming GX’s new Director of Developer Relations, Josh Zheng

  • GX Airflow operator improvements

  • Monitoring data lake health at scale, from Davide Romano

Roadmap update

Recently: Data Assistants launched on September 20

  • The first Data Assistant is the Onboarding Assistant. Data Assistants replace the UserConfigurable Profiler. They’re new, so be sure to give us your feedback!

Current focus: Making it easier to get started

  • In terms of features, we’re working on:

    • Improving DataContexts with better doc strings and more consistent save behavior.

    • A new approach to data sources. Watch an overview in last month’s update.

  • We’re also improving the developer experience:

    • We have new goals for responding to contributor PRs, and aim to respond within 24 hours and merge within a week (depending on scale). An important note on this: if we close out your PR due to inactivity, don’t hesitate to reopen it when you’re next able to continue the conversation!

    • We’re also focusing on creating API documentation that’s on par with other API docs that we love, like Pandas’ and SciPy’s.

Up next: SDKs

  • Improved API docs, automated tests and test environments, a well-defined publishing experience, and more.

Watch the roadmap presentation:

Welcome

We are thrilled to welcome Josh Zheng, GX’s new Director of Developer Relations! 

Josh is excited to meet community members—don’t hesitate to DM him @Josh Zheng on the GX Slack, and if you’re in the Bay Area, he’d love to buy you a coffee and talk GX in person.

Feature update

Exciting news about the Airflow operator: in collaboration with the Astronomer team, we’ve officially turned ownership of GX’s Airflow operator repository over to the Astronomer team for long-term maintenance with their other operators.

An announcement is forthcoming about a new version, but the updates will make it easier for new GX users and add improved support for OpenLineage.

Check out the video for additional comments from Benji Lampel, Airflow Engineering Advocate at Astronomer.

We appreciate the Astronomer team’s commitment to ensuring a best-in-class integration with GX!

Community demo

This month we had a great presentation from Davide Romano, a data scientist at Mediaset. 

Watch the presentation here:

 

Or read on for a summary!

Background

Mediaset is a leading private TV publisher in Italy and Spain that built a data lake with 100GB of incoming data daily. As data volumes and daily queries increased, Mediaset advanced to  actively ensuring their data lake health. They used Great Expectations and Spark to build a data quality framework.

Needs

Mediaset had four critical data quality needs:

  • App event tracking verification

  • ETL pipeline unit tests extension

  • Pipeline and services monitoring

  • Data drift detection

Workflow

Mediaset has a five-step workflow:

  • A Docker image running a Jupyter notebook and GX is used to develop Expectation Suites, including Custom Expectations. Custom Expectations are key to Mediaset’s GX implementation.

  • Once complete, Expectation Suites are committed to Mediaset’s git repository via a CI pipeline. The Expectation Suites, Custom Expectations, and Data Docs are stored in S3.

  • Airflow executes the data quality DAG.

  • Apache Superset is used to visualize results and send alerts.

  • GX’s Data Docs are published in Mediaset’s data documentation. This is key for sharing table checks with Mediaset’s nontechnical team members.

Mediaset’s data quality evaluation process

Davide explained how Mediaset incorporates insight from its nontechnical domain experts with a feedback loop between domain experts and GX experts to fully define the data quality requirements before GX is used.

Davide also answered a few questions from the audience. Click here to skip to the bonus/Q&A portion of the talk:

 

And the conversation continued in the GX Slack.

Thank you Davide for sharing your experience!

Read Davide’s article on Medium, and see the hands-on repo at Github.

Join the data quality conversation

Additional updates

Have you done something cool with Great Expectations that you'd like to share? If you're interested in demoing or have a piece of data quality content that you'd like us to feature, DM @Kyle Eaton on our Slack.


Like our blogs?

Sign up for emails and get more blogs and news

Great Expectations email sign-up

Hello friend of Great Expectations!

Our email content features product updates from the open source platform and our upcoming Cloud product, new blogs and community celebrations.

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Banner Image

Search our blog for the latest on data management


©2024 Great Expectations. All Rights Reserved.