Just One More Stratification!

Or: How to say “no” as a data person

March 17, 2020


Written by Sam Bail

I’ve been working in “data” for a few years now, with the majority of my time spent on pretty much all aspects of data engineering in the context of an actual external facing data product. In a previous role, I was mostly responsible for the “plumbing” and data quality testing, rather than data analysis or data science. However, over the course of the years I also got to work on a few internal data analytics projects that were looking for insights into feature usage of a SaaS product. One of the things that stood out to me about this type of analytical work was the almost insatiable demand of consumers for more data – more analyses, more filtering, more stratifications. Often, these requests would come in as ad-hoc requests: “Hey, these numbers look great, could you group them by year in addition to month so we can see some trends?”, or, “Maybe if we exclude this kind of population… could you just add that filter real quick?”

Since I had a direct comparison with my experience working on a data engineering team, I noticed a difference between how stakeholders approached data engineering work compared to analytical work. It almost seemed as though they considered it to be… somewhat easier to “just run some numbers real quick”, whereas I hardly ever received requests to just “build a new feature real quick”. Here are some of my thoughts as to why that might be the case, and what “data people” can do to avoid getting stuck in an endless loop of “just one more stratification”.

  1. Understand the problem and ask “why” a lot: This is definitely not a problem that’s specific to data work, but a pretty well-known issue in software engineering and other disciplines. Stakeholders often tell you what they want based on what they think will solve their problem, but don’t always tell you the why. Oftentimes, when you start digging into the why, you’ll find that their proposed solution doesn’t really address the problem, or only part of it. Another thing I noticed is that people are often curious about data, but can’t really think of any concrete actions they might take based on what they see. Curiosity and getting a mental model of some numbers is great and might be a valid use case, but oftentimes this ends up having no direct impact on the business goals. While it may be tempting to dive right into the data and give your stakeholders what they ask for, it’s usually worth spending a significant amount of time on understanding their needs and how they fit into higher-level business needs, both immediate and long-term.

  2. Agree on the specs: It’s easy to just brainstorm ideas and get an idea of what users want, then go off and pull some data, but this also sets you up for scope creep. Treat data work the same way as a software project: there’s a clear spec including stratifications and filters, some form of plan based on priorities, as well as customer acceptance criteria. One option is also to prototype some output using dummy data to make sure your stakeholders have the same idea in mind of what the final deliverable looks like – e.g. will they get a CSV file, an Excel, a pivot table, an interactive dashboard, some numbers in an email…? Additionally, treat data work as regular work – if you’re working with a ticketing system, make sure to estimate and assign points, stick to a sprint cadence, and/or timebox work. This allows you to treat the project as an actual project with a plan – and if stakeholders ask for work outside of the currently scoped project, you can ask them to re-prioritize other work in favor of the new request.

  3. Empower users: A lot of people in stakeholders positions are incredibly curious when it comes to data. They want to experiment with slicing and dicing data to get a good mental model of what the space looks like. One way to take the load off of analysts that might have to deal with ad-hoc requests is to give your non-technical users an interface to explore the data. It’s important to have data access and literacy in an organization to avoid making engineers bottlenecks, but there are a couple of things to consider here:

    • Any platform, even if it’s “just” internally facing, is a product that needs to be treated as such. As in: it takes time to build, it needs a roadmap, an owner and plan for maintenance/enhancements/bug fixes, a plan for authentication and authorization (who/which team gives and gets permissions to what?), training and onboarding of new users, a plan for availability/uptime… And if you’re thinking “oh hell that sounds like a lot of work”, you’re absolutely right. It’s an investment, and you need to consider the ROI and long-term strategy when embarking on such a project.

    • Data can be misinterpreted. Make sure to understand what users might want to do and manage expectations as to what they can and cannot infer from the data they’re being provided. Any uncontrolled use of data should probably just act as hypothesis generation, with a clear process for establishing “production-ready insights” in collaboration with a data team.

    • Data can be wrong. One risk of a self-service data exploration for non-technical users is a potentially less stringent data QA process, due to lack of tooling - the users have to believe what they see and might not have a way to dig deeper or look into the source data to verify that the output is correct. This can be avoided by building data validation into data pipelines and ensuring that incoming data is correct, complete, robust against pipeline changes, and meets the expectations of the domain experts.

  4. Due diligence: This is less about managing stakeholders, and more about avoiding extra work for yourself. Even for small ad-hoc requests and pulling some quick numbers, I believe code reviews or just sanity checks (looking over the shoulder style “reviews”) are important – first and foremost to ensure what you deliver is correct, of course, but in the second place also to avoid having to go back and redoing work because of a bug. This also applies to understanding and cross-checking whether what you’re doing actually matches the specs.

These are really just a few suggestions based on my experience working in a data role, and while I obviously can’t claim they’re typical, I’m hoping that the above points resonate with you, fellow data person. Though it can be difficult at times to create more friction up front with stakeholders, these steps may prove to be a worthy investment of time to them and also your sanity.

We’d love to hear your thoughts on this article and find out more about the kinds of typical requests you receive and how you handle them - feel free to tweet me @spbail #onemorestratification!

Greetings! Have any questions about using Great Expectations? Join us onSlack
Have something to say about our blog? Shout it from the rooftops!
The Great Expectations Team

You should star us on  Github

Greetings! Have any questions about using Great Expectations? Join us onSlack