Community contributor interview: Steven Secreti

Learn more about the contributors & community members who help make Great Expectations what it is. Featuring: Steven Secreti.

Kyle Eaton

October 26, 2022

Never miss a blog

Community contributor interview cover card for Steven Secreti

Great Expectations (GX) wouldn’t be where it is today without the many talented people who contribute to the open source project and participate in our Slack community. We appreciate each and every one of our 8,000 members (and growing)!

Today, we’re profiling Steven Secreti.

What is your current role and organization?

Currently, I am a graduate student pursuing my M.S. in Computer Science at Stony Brook University. When I made my contributions to Great Expectations, however, I was an intern in the Technology Intern Program at Capital One.

🍕 🍦 What is your favorite pizza topping and flavor of ice cream?

I usually like a regular slice, and if anything I would add extra sauce. Although in terms of a specialty slice I’d have to go with buffalo chicken.

Definitely coffee-flavored ice cream.

🔎 How did you discover Great Expectations? What were your first impressions?

I was first introduced to Great Expectations by my manager, Taylor Turner, when I started on the Data & Privacy team in the Center for Machine Learning at Capital One.

The Data & Privacy team developed Data Profiler, an open source solution that uses machine learning to help companies monitor big data and detect private customer information so that it can be protected.

Taylor Turner and Jeremy Goodsitt previously created the “Capital One Data Profiler Expectations” package within Great Expectations to expand upon the functionality of Data Profiler, as well as to contribute back to the community.

Taylor tasked me with expanding upon the Expectations within the package as a part of my work as an intern this summer [of 2022].

After familiarizing myself with Data Profiler and Great Expectations, I was very excited to get to work. I instantly saw the potential optimizations that both packages could provide to the community when applied correctly under the proper conditions.

I was and still am convinced that the functionality provided by Data Profiler, coupled with the foundational protocol established by Great Expectations, will be adopted by several individuals and teams who seek to make their machine learning tasks more efficient.

🌱 What are some things you find rewarding about contributing to an open source project? Is there anything rewarding about contributing to the Great Expectations community specifically?

I think the predominant component that makes contributing to an open source project so rewarding is the fact that the work you do has no ceiling on the amount of value it can provide.

I think most software developers would agree that a lot of the code they write can be repurposed for several different applications or utilized in many different use cases. The beauty of an open source project lies in the free access to functionality that they provide.

I am glad to know that the work I contributed to Great Expectations will not only be used by the Data & Privacy team at Capital One but will also be used by others in the future to their benefit.

I think a rewarding component of contributing to the Great Expectations community was the actual community members themselves.

Over the course of weeks, I saw countless members of the community with interesting ideas, requests, or insights throughout the Great Expectations Slack channels.

Being part of that community and seeing some of the member’s perspectives first-hand helped me on a few different occasions in either solving a problem I was stuck on or in coming up with a new idea on how we could enhance both Data Profiler and our contributions to Great Expectations.

🛠️ What do your Expectations do, and what are some reasons an organization or a data team would want to use them?

The main group of Expectations I contributed to the Capital One Data Profiler Expectations package were the expect_profile_numeric_columns_diff_… Expectations.

At a high level, the idea behind these Expectations is that when called on a dataset, they will compare the new dataset’s profile to a snapshot of the same dataset from an earlier point in time.

The difference would be taken between the old profile and the new profile, generated from the dataset on which this Expectation was called. The result of the difference operation between the numerical columns of the two profiles would be checked against a dictionary of thresholds provided by the user to determine the success of the Expectation.

The purpose of this set of Expectations is to provide a basis for automated data quality checks in any dataset a user may desire to monitor.

A good example of this is in the case of a new company with a dataset of their transactions throughout each month and a column containing the dollar value of each transaction.

One of the expect_profile_numeric_columns_diff_… Expectations, run with the proper parameters, will ensure that the company’s revenue goal is increasing month-to-month as planned and alert the company in the case this Expectation is not met.

In the future, the team hopes to expand upon this concept to automatically detect data drift.

📣 Are there any other open source projects you’d like to shout out?

I’d particularly like to shout out Capital One’s Data Profiler. I had the privilege of working with the team behind Data Profiler for 10 weeks and I think it is an incredible project overall.

It is clear that the individuals working on Data Profiler take immense pride in their work and aim to provide an excellent product. I am positive that machine learning engineers and teams across many different industries will adopt Data Profiler in their day-to-day workflows as it gains exposure in the coming years.

✅ What does data quality mean to you?

In my opinion, I think there are two components that make up data quality.

The first is the basic component of the actual information itself. Are there several missing values? Was the methodology in gathering this data consistent? These questions reflect upon a specific dataset itself.

The second component of data quality relates to the implications of the data. Mainly, what is the effect of using this data to create my machine learning model?

I think it is important to consider here that this question does not only apply to a specific dataset, but also to the data that a machine learning model will make predictions on.

Data that is perpetually updated is important to identify, acknowledge, and account for in building any machine learning model. Here, data quality is a relative term that considers the way in which data evolves over time and under different circumstances.

Thanks to Steven for taking the time to speak with us!

If you’re thinking about joining the GX community, there’s no time like the present. Ease in by lurking in Slack or go straight to sharing your Custom Expectations: we’re happy to have you no matter how you want to engage.

Get involved:

You can join the GX Slack here.

Check out our guide for getting started with contributing.

It’s easy to share a Custom Expectation if you follow our step-by-step process.

To contribute a package, start with this how-to, so everything as easy as possible.

Community contributor interview: Steven Secreti

Learn more about the contributors & community members who help make Great Expectations what it is. Featuring: Steven Secreti.

Kyle Eaton

SHARE THIS ARTICLE

Never miss a blog

Search our blog for the latest on data quality.