Dispatches from a chaotic corner of data science

When data gets messy, this real estate data scientist knows what to do

Nick O'Brien
June 07, 2023
Nick O'Brien
June 07, 2023
A "for sale" sign of the kind used in real estate with trees in the background
Get the inside story on the data behind real estate transactions (📸: Stephen VanHorn via Adobe Stock)

To the average person, real estate purchasing may seem like a relatively straightforward process of applying for loans and bidding on homes. But while you’re strolling through open houses wondering if there’s beautiful, glossy hardwood under that ugly bedroom carpet, there’s a complex network behind the scenes of private and public entities building and maintaining vast troves of real estate data. 

These databases log myriad details about properties of all kinds and exchange data constantly—and all of this plays out across thousands of jurisdictions spanning the whole country.

If this sounds like an invitation for things to get messy, you’re not wrong.

That’s why we sat down with Lucas Roy, a data scientist whose resume includes stints at companies like Zillow and Pacaso, to better understand what data professionals can do to maintain quality—whether in real estate or any other vast, complex, and fast-moving corner of the data science world.

Some background: where does real estate data come from?

Data scientists in real estate typically get their data from two sources: multiple listing services (MLS) and counties.

Multiple listing services

You might not expect a highly competitive and commission-based field like real estate to involve much collaboration. In reality, real estate agents rely heavily on MLSs, which allow agents to share and view information about listed properties—typically attributes like square footage, number of bedrooms and bathrooms, or architectural style—and use that information to help each other find buyers and close sales. 

An MLS can operate at the state or local level and, despite existing as a tool for cooperation, can be highly exclusive. “You have to be a realtor to access MLS data, and even then, you have to submit an application,” Lucas tells us. “It’s not uncommon to see an MLS carve out territory and be really protective of it—almost like a gang.”

County data

The other main data source in real estate is county-level records. These are divided into two categories:

  • Tax assessor datasets help governments assign a value to a home to calculate owners’ property tax liabilities.

  • Transactional records illuminate the deeded history of the house—who owns it now, who owned it previously, and when it was last sold.

Where real estate data gets challenging

As any real estate data scientist will tell you, their job is no walk in the park. With myriad repositories hosting enormous volumes of ever-changing data, things get messy quickly. 

Here are some of the biggest challenges that data scientists like Lucas face at work.

Challenges with data quality

The intense protectiveness around MLS data we alluded to earlier can create major headaches for data scientists. Zillow, for example, is locked out of many MLS feeds, which limits coverage and, as a result, harms data quality. 

Another blow to quality often comes after a property is sold and the listing goes offline when many MLSs wipe all the data related to that listing. And all of this plays out in a system that is entirely decentralized, with hundreds of MLSs across the country and plenty of territory overlap, creating fertile ground for further confusion.

With county tax data, many of the challenges stem from structural issues. There are roughly 3,200 counties in the United States, with a ton of variance in data management norms. In California, Lucas explains, “every county does it differently. The code for a single-family home, for instance, might be 100 in one county and 150 in another. This makes it tough to do broader analytics about real estate in a larger region.” 

Things get even harder when counties change parcel number formats, often leading to entire property histories disappearing. “If you look at a property in Zillow and see a huge gap in its transactional history, it’s probably because someone updated the format and, in doing so, lost the link to the historical records,” says Lucas. 

Ingesting and analyzing data under these circumstances often requires extensive ELT work, which many counties conduct in ways that rely heavily on manual workflows. This process not only slows things down, it essentially acts as a magnet for human error. 

“The way a lot of counties get their data into a database for analysis is by literally sending a photograph of a deed to an offshore vendor, who then types the ownership history, bedroom and bathroom counts, and other values from the photograph into the database by hand,” Lucas explains. “When this happens on a large scale, you start seeing lots of crazy errors. In one case I saw recently, someone had listed a 1,500-square-foot house as having 60 elevators. You don’t have to be an expert to realize that probably isn’t true.” 

Challenges with reconciling sources

The other big challenge comes when data professionals try to join MLS and county data—that is, combine the two sources to create a more comprehensive, 360-degree view of one or more properties. Joining sources is a helpful way to confirm the accuracy of real estate data and better understand a property’s history and characteristics, all of which can help agents sell.

Your instinct may be to use a property’s parcel number—basically, an ID number associated with a piece of land—as the joining point for reconciling MLS and county data about that property. 

But while parcel numbers are theoretically the same across MLS and county contexts, they’re hardly static. The parcel number format might change if the county switches to a new vendor to manage its data. Certain development activities or changes in plot ownership can result in parcels being combined or dropped from the tax rolls. 

In the end, data scientists who think of parcel numbers as a “single source of truth” will often find those numbers aren’t as reliable as they’d hoped.

Expert tips to take control of your data

So what can data professionals do to bring order to the chaos and make unwieldy data less cumbersome to work with? Lucas has a few tips to offer:

  1. Build guardrails
    A lot of data quality work starts with a simple question: Does this data make sense?

    Property values typically don’t increase by a factor of 10 in one year, so if a property was listed in your database at $500,000 last year and this year is listed at $5 million, the most likely explanation is that a decimal point somehow ended up in the wrong place.

    That being said, sometimes the unusual does occur. If, between last year’s data entry and this year’s, a new shopping center was built on the property in question, then the property value may indeed skyrocket. So while it’s good to investigate anomalies, be wary of assuming that a particular value is wrong.

    Instead, try monitoring data drift and flagging changes of unexpected magnitude. From there, you can investigate whether the data is incorrect or just unusual.

  2. Use specialized tools
    Real estate data is complex and sensitive. As we’ve outlined, any number of things can go wrong, and data scientists need technology resources that can reliably catch—or, even better, prevent—as many issues as possible.

    Unfortunately, generic tools like Microsoft Excel aren’t usually up to the task. Lucas described numerous instances of a county sending him an Excel file in which entire columns were full of improperly formatted or just plain inaccurate data, and his only recourse was to contact the county and ask them to fix the problem and resend it.

    That’s why Lucas recommends using customizable tools that can be configured specifically to address the kinds of issues your organization faces. “At one company where I worked, we had entire systems and software engineering teams dedicated to building tools for us to use in our data quality and ELT work,” Lucas says. “A lot of these tools use Python functions, which are really useful for ELT.”

  3. Plan your data quality timeline
    In the data world, there’s a push and pull between two instincts: taking the time to ensure your data is flawless before publishing it and cleaning up any errors later so you can move fast now.

    Each approach has its pros and cons. “When you’re working with consumer-facing data, I find it’s better to do the data quality work up front,” Lucas says. “If we publish bad data and it makes it into a listing, that reflects badly on us.”

    On the other hand, if you’re doing internal tooling or you’re working for a smaller organization, you might choose moving fast over being perfect. As Lucas puts it, “You have to ask whether it’s better to spend a ton of time up front perfecting the data sources or to have something more middle-of-the-road.”


Messy data landscapes require vigilance, tenacity, and specialized solutions. Real estate data exemplifies this reality to a tee. But it’s not the only industry that does—so if you found yourself relating to any of the issues Lucas described in this article, check out Great Expectations to ensure your data quality no matter how messy things get.

Like our blogs?

Sign up for emails and get more blogs and news

Great Expectations email sign-up

Hello friend of Great Expectations!

Our email content features product updates from the open source platform and our upcoming Cloud product, new blogs and community celebrations.

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Error message placeholder

Banner Image

Search our blog for the latest on data management

©2023 Great Expectations. All Rights Reserved.