As data grows in complexity, analytics teams generate even more complex pipelines. The outputs of those pipelines allow the team to build extensive dashboards for marketing, sales, you name it.
Complicated, intertwined pipelines aren’t a bad thing. With more transparency, organizations are given the tools needed to truly make data driven decisions.
Scope creep: expected and normal
How do pipelines actually become so complex?
Let’s start from the beginning: a stakeholder wants to see something in a dashboard. So, the analyst writes the query and works with data engineering to have it run on a schedule and builds the visualizations. All set, yes? Well, not so fast.
After looking at the resulting dashboard, the stakeholder realizes they forgot to mention one key requirement. If the stakeholder is the VP of Marketing, say they want to filter the ad reports from Google based on the type of ad shown (display ad, search ad, etc.).
This request would be easy if the ad type was a field that already existed in the marketing dataset. If it doesn’t, the analyst must find who can update the ETL job to include the new piece of information (for illustration purposes, say the organization hasn’t adopted a tool like Fivetran yet).
After that’s done, the query is updated and so are the dashboards.
In software engineering, features that aren’t in the original request but somehow creep in are considered widened scope. In that world, widening scope isn’t great.
For data professionals, scope creep is normal.
The back-and-forth between analysts and stakeholders is exactly the type of fluid communication that data driven organizations have. If this type of communication didn’t exist, the VP of Marketing simply wouldn’t find the dashboard useful and would try to piecemeal the information together from elsewhere. Doesn’t that put the concept of a single source of truth, what should be the warehouse, into jeopardy?
When the pipelines work, all the dashboards look great. Inevitably, that’s not always going to be the case. What does that scenario look like in the world of growing pipeline scope?
Remediating dashboards: a process
Say one day the same VP of Marketing comes to you and says, “my dashboard is broken, can you fix it?”. The last time you checked, it was perfectly fine. The natural place to start the root cause analysis is within upstream processes. If the links are thoroughly mapped out, it’s not unusual to discover that 30% of the whole pipeline is implicated as a possible cause of the problem.
Without any existing checks in place, triaging the issue could take several hours of an analyst’s time. Furthermore, imagine the blue line above separates two teams. More often than not the issue is due to a breakage in one of the links that crosses the team boundary. Coordinating across teams moves the estimate of hours of digging into days of back and forth.
With our marketing example, there’s the ETL that brings third-party ad data into the warehouse. Regarding the actual data, the raw data coming out of those platforms is not something anyone at the organization has any control over. Lastly, the analyst’s query likely joined other related datasets, which are generated by other queries. To properly find the root cause, all of these datasets and handoff points between datasets must be thoroughly investigated.
We can all agree, this isn’t the best use of an analyst’s day.
Why should root cause analyses be made easier?
The way we build isn’t to make our lives easier when everything is working (sometimes, but usually not the focus). It’s generally to make fixing issues (read: fixing bugs, or debugging) easier.
Just like in software engineering, debugging is an important consideration in analytics. Whether it’s debugging broken code or data that doesn’t logically make sense, maintaining a pipeline includes fixing issues as they arise.
Because yes, issues will arise.
We just agreed an analyst shouldn’t spend an entire day blindly hunting down an issue in a pipeline they may not have even built. But shouldn't an analyst’s job be to maintain dashboards, if their team built them?
A construction worker’s job could be to build a house, but they wouldn’t be very keen on doing the job using just a hammer. Just like a construction worker can get a lot more done with power tools, a data analyst becomes much more productive when given the tools to work efficiently.
What does being an efficient analytics team really mean? Continue to Part 2! Maximizing Productivity of Analytics Teams Part 2: Complexity Carrying Capacity
Recap
- Scope creep is normal.
- Pipelines use data from multiple teams, making debugging harder.
- We make architectural decisions to make debugging easier, giving data analysts tools to be more productive.
this blog is inspired by Strata 2018: Pipeline Testing with Great Expectations and Testing and Documenting Your Data Doesn't Have to Suck
By Sarah Krasnik
Thanks for reading! I love talking data stacks. Shoot me a message.
Data Engineer
Data Blogger
Data Advocate