Great Expectations Case Study:
How Rent the Runway uses GX to make data more accessible and find errors faster
GX automates quality testing so RTR's data teams can easily bring new data sources into their warehouse
About Rent the Runway
Rent the Runway is transforming the way we get dressed by pioneering the first Closet in the Cloud. Founded in 2009, RTR is disrupting the $2.4 trillion fashion industry by inspiring women with a more joyful, sustainable, and financially-savvy way to feel their best every day.
As a leader in the circular economy for fashion, the brand offers multiple points of access to its shared closet via a customizable subscription to fashion, one-time rental or ownership. RTR offers designer apparel, accessories, and home decor from 700+ brand partners and has built in-house proprietary technology and a one-of-a-kind reverse logistics operation.
Under CEO and Co-Founder Jennifer Hyman’s leadership, RTR has been named to CNBC’s “Disruptor 50” five times in ten years, and has been placed on Fast Company’s Most Innovative Companies list multiple times, while Hyman herself has been named to the “TIME 100” most influential people in the world and as one of People magazine’s “Women Changing the World.”
The Data Engineering team at Rent the Runway supports data science and machine learning efforts across the organization. They’re responsible for ingesting data from various sources and applying the first layer of business logic to “clean” the data.
Most of the data the team works with comes from application backends at Rent the Runway, as well as internal APIs and third-party APIs from vendors. Examples of types of data processed are inventory data in the company’s physical warehouses, information about the individual items of clothing and their current status (e.g. whether they’re rented out by a customer, being dry cleaned, or “on the rack” and awaiting rental), data about members and memberships, and some data around members’ preferences.
The team’s data stack consists of streams which are ingested in real-time (but processed only periodically), a data transformation layer using dbt, and their data warehouse. In addition to those sources, the team also receives data through CSV files; for example, data from a new marketing partner or information from stakeholders about a new process.
One of the main issues the team has been dealing with is uploading those CSVs into the data warehouse. The upload job would often fail because an uploaded CSV would not match the required data model: for example, the file would have different columns.
But even when the upload succeeded, the team would find incorrect data which would require manual fixes, such as deleting the faulty rows and uploading new rows. This led to a long cycle of manual trial-and-error uploads and fixes, which would lead to long dev loops.
While the team also uses dbt tests in their pipelines, they wouldn’t find out about issues with the uploaded CSV files until the dbt tests were run. This meant that a user would have to re-upload the file and re-run the affected dbt models as well as downstream models to determine whether the issue was fixed.
Eventually, the Data Engineering team came across Great Expectations while looking for solutions to these data upload problems.
Great Expectations provided a general-purpose framework that could be applied to all use cases the team wanted to handle and didn’t require a custom build for each stakeholder. In particular, they pointed out the benefit that Great Expectations was generic enough to meet all use cases, but could still be tailored for each use case.
How Rent the Runway uses GX
The Data Engineering team currently runs Great Expectations as a backend for the CSV upload tool. One aspect that we find worth pointing out is the split of responsibilities between the Data Engineering team and downstream data teams and the process they use to create Expectation Suites: While the Data Engineering team owns the Great Expectations infrastructure, the downstream data teams actually own creating the relevant Expectations themselves.
This meant that in the beginning, the team had to teach other users how to use Great Expectations and create some internal documentation about workflows, but the effort paid off. Teams who want to create Expectation Suites to validate future CSV uploads now run through the following workflow:
Launch a Docker container with Great Expectations.
Move the CSV file into the relevant folder.
Use the Great Expectations command line interface to launch a pre-populated notebook to provide profiler-generated Expectations.
Review and edit the generated Expectation Suite to fine-tune the Expectations.
Create a Checkpoint that validates a data asset with the created Expectation Suite.
Whenever a user uploads a new CSV file, the Checkpoint is run to validate the file, Data Docs are generated, and the output is sent to cloud data storage.
As an extension to the CSV upload, the engineering team even added functionality to simply input a Google Sheets URL into the uploader tool, which would kick off a validation process with Great Expectations and upload the sheet to the data warehouse if validation was successful.
In addition to the source data tests with Great Expectations, the team still uses dbt tests in their pipelines to test the transformations. “We try to test our base tables before they get to dbt, so we can focus on the model we’re building. The dbt tests are still valuable to test the business logic; we see them as complementary!” (Danielle Dalton, Data Engineer at Rent the Runway)
The main benefit the Data Engineering team experienced since implementing Great Expectations in their upload tool is that data has become more accessible to people, as the CSV data sources would not be easy to add to the data warehouse otherwise.
“This allows our data teams to use the data from their CSV files in our data warehouse, spend more time on analysis, and focus on more important things than data ingestion.” (Danielle Dalton, Data Engineer at Rent the Runway). The team also reports that the users enjoy Data Docs in particular, as it allows them to see any data issues directly and use the Docs as a “troubleshooting tool” when interacting with stakeholders.
5 times in 10 years
Named to CNBC’s “Disruptor 50”
Data Docs show issues directly
- CSV files
- Google Sheets